John Ioannidis on Statistical Significance, Economics, and Replication
Jan 22 2018

stat%20hacking.jpg John Ioannidis of Stanford University talks with EconTalk host Russ Roberts about his research on the reliability of published research findings. They discuss Ioannidis's recent study on bias in economics research, meta-analysis, the challenge of small sample analysis, and the reliability of statistical significance as a measure of success in empirical research.

Andrew Gelman on Social Science, Small Samples, and the Garden of the Forking Paths
Statistician, blogger, and author Andrew Gelman of Columbia University talks with EconTalk host Russ Roberts about the challenges facing psychologists and economists when using small samples. On the surface, finding statistically significant results in a small sample would seem to...
Ed Yong on Science, Replication, and Journalism
Ed Yong, science writer and blogger at "Not Exactly Rocket Science" at Discover Magazine, talks with EconTalk host Russ Roberts about the challenges of science and science journalism. Yong was recently entangled in a controversy over the failure of researchers...
Explore audio transcript, further reading that will help you delve deeper into this week’s episode, and vigorous conversations in the form of our comments section below.


Luke J
Jan 22 2018 at 4:43pm

Didn’t Gelman say p-values by definition presume noise? In other words, when there is no effect, the probability that data produces X findings are less than .05?

I think I am still unclear on this.

Dr golabki
Jan 22 2018 at 7:15pm

There’s an interesting tension in this episode between two of the major recurring themes of econtalk.
1. The challenges of interpreting data when experiments/analysis are not clear and well planned ahead of time.
2. The risk of increasing government control over private enterprise creating negative consequences.

In my view the heavy involvement of FDA and other regulatory bodies is the primiary driver for the pharma industry using well controlled trials with prespeicified endpoints as the primary method to demonstrate a drugs effectiveness.

I wonder if Russ agrees and if so, if he has any thoughts why the FDA has been able to maintain this “gold standard” in the face of significant political pressure.

Dr Golabki
Jan 22 2018 at 7:22pm

@ Luke

p-values presume noise… if you didn’t presume noise you wouldn’t need statistical significance… you’d just be able to measure the magnitude of the effect.

Greg G
Jan 23 2018 at 7:55am

I agree with Dr. Golabki that there is an obvious tension between those two basic EconTalk themes he cites. And the point doesn’t depend on your view of the FDA.

As to the first theme, I think that EconTalk provides a very valuable service by showing us how difficult it is to pull cause and effect apart in complex systems as it did in this podcast. And how the resulting inability to disprove cause and effect in complex systems feeds confirmation bias of all types.

But many of the same people urging this caution often suddenly develop a surprising confidence in how much better any number of counterfactuals would work out in the absence of any government action.

Jan 23 2018 at 9:08am

Loved the episode. This is perhaps my favorite type of discussion which Russ Roberts does as part of the exploration of knowledge. In my career, I work with statistics almost daily and continuously help engineers and quality professionals improve their thinking (hopefully am helping – only tangentially statistics). When it comes to statistical studies, I follow and teach 2 rules to determining validity – replication and a theoretical explanation for the result. If either of those are missing any ‘significance’ is typically false. And even in my small sample size experience of maybe a couple hundred studies… there has been p-hacking, selective data elimination (beyond outliers or known false data), post data generation theoretical modelling (math gymnastics to manufacture a model to fit the data), even outright fake data. Humans are built to find the pattern in any data and scientists/engineers are masters at it. Stay humble and keep digging because there are always some really cool new things to learn. Really appreciated the episode.

J Scheppers
Jan 23 2018 at 10:54am

Dr. Roberts:

I will attempt to explain my different expectation of 95% certainty and why only 8% to 9% replicate.

The conclusion of most standard statistical tests is that there is less than a 5% chance that the measured result is random.

To impute weight to the magnitude of the result does not hold unless specifically stated and again would need to be done with a range of certainty.

A test that replicated the expirement 100 times and in 95 or more of those trials the result was measured above zero would support the conclusion of rejecting the null hypothesis in the original study. The expectation that the experiment could be run with the same order of magnitude of the result and same level of confidence is not the conclusion of standard null hypothesis test.

This demonstrates the weakness of standard conclusions. My expectation is in the best case replication only 50% would replicate. Note also that P-hacking can be done without bad intentions. 100 scientist study a popular theory in the same manner. 5 of the 100 find the effect that had been theorized and successfully publish their papers.

Todd’s rules of replication add great value. I see great value in searching for patterns in data and it is science to find those patterns, I am not even opposed to interim P-hacking, but it is the independent test after you are done searching that can measures certainties. P-Hacking cannot be eliminated, but replication can help police P-Hackings proper role. In fact the separation of the replication from the original finding and theory is highly powerful statistical evidence.

Effective actions or certainty seem elusive after simple null hypothesis test which the acadamy is frequently using as a call to action.

Luke J
Jan 23 2018 at 9:33pm

Isn’t the magnitude of the effect what we want to measure? I thought that is the point of a study.

Jacob M
Jan 24 2018 at 12:24am

I might be missing something, but isn’t the most straightforward improvement in a lot of these cases to create training and testing sets for the data? You could even talk to a journal or third party to act as a trusted third party to hold onto the testing set so that p-hacking wouldn’t generalize and the model would fail. I find it strange that a practice so common within data science doesn’t yet seem to have penetrated economics, even though the two fields are related.

The third party could even make the unpublished results available if the researchers are not able to publish, allowing a “journal of insignificance” that Russ has joked about in previous episodes.

James Pass
Jan 24 2018 at 2:57pm

Excellent guest and discussion. It’s only January but I have a feeling this might be the episode I pick as “Best of 2018.”

I have to say, though, that I was shocked to hear Russ say that he knows “many economists” who don’t understand the problems with small samples. I assume that economists must be fairly proficient in statistics and data analysis, so how can this be?

Dr Golabki
Jan 24 2018 at 5:12pm

@ Jame Pass

Most economist know a fair amount about statistics. I think Russ and John made a few points here:
1. There a some non-obvious effects of sample size (e.g. small samples increase the magnitude of “significant” results) which quite a lot of people are unaware. At least I hope so since I was unaware of it!
2. Many economists often use traditional statistical tests, even when not appropriate, and justify either by ignoring the problems with the method or by saying “it’s not strictly correct, but it’s much simpler and the problems are relatively small in this case”.

Ultimately I think this boils down to the journals. If academic journals don’t hold a high standard, researchers will always justify cutting corners to themselves.

Jan 24 2018 at 8:37pm

Two questions about improving the current system:

1. P-value can be calculated,right? If so, why do we need a hard threshold of .05 or .005 for P-value? Why not just say: “this test passed P=.05 and P=.01, but failed the Null at P=.005? Wouldn’t this be better than a test that passes P=.05 but fails the Null at P=.01?

2. To encourage others to replicate the study, would it be better if the editor pre-commits to publishing at least one replicating study if confirming or disputing the original findings? They should also incorporate the replicating study with the original so anyone downloads the pair by default.

Dr Golabki
Jan 25 2018 at 8:12am

@ Norlin

Getting replication studies published in top journals is part of the problem, but it’s also cultural. More importantly though, is the funding problem. Who pays for replication work?

Jan 25 2018 at 11:54am

@ Dr Golabki

[Who pays for replication work?]


(sorry, couldn’t resist)

James Pass
Jan 25 2018 at 1:44pm

Dr. Golabki wrote: “small samples increase the magnitude of ‘significant’ results which quite a lot of people are unaware.”

I agree that quite a lot of “everyday, normal human beings” are unaware of this, but I expect economists (or any kind of researcher) to be very much aware of it. It’s an intuitive, basic concept that should be taught in an introductory statistics course. The concept can easily and consistently be demonstrated merely by flipping a coin ten times.

Maybe I’m expecting too much, but when you say it’s a “non-obvious effect,” I’m inclined to think it isn’t obvious only for people who have never studied statistics. I mean “obvious” in a certain sense, in the way something can seem obvious after making careful empirical observations, thinking about it, testing and verifying. For example, in the way that it became “obvious” that the Earth was round, even though it wasn’t immediately obvious. (My apologies to anyone here who is a member of the Flat Earth Society, who ironically have to go to extraordinary lengths to prove that the Earth is flat, as it “obviously” appears to be.)

Dr. Golabki wrote: “Many economists often use traditional statistical tests even when not appropriate”

I’m sorry to hear that. I agree with you that academic journals should have high standards.

Jan 25 2018 at 1:56pm

I enjoyed the discussion. Epidemiologists have struggled with these problems. I am strongly on the side of the guest – most of what we do is garbage but this is true across the majority of science even if the methods are good. Most science does not accumulate new useful knowledge. The problem is how much wrong knowledge we have that distracts people.

For example in a certain textbook I read that a certain cancer was associated with being in the laundry industry. I reviewed the source of that finding and determined that it was from a single study that was poorly powered and completely exploratory. The authors of the paper were appropriately modest and said that this was preliminary and should not be considered real. But there it was in a textbook for people to memorize for exams. Too many results the authors discount enters the culture and confusion of the media and internet.

One of the issues raised did not seem correct to me and I may be misunderstanding. If you have a small sample of men and women and compare heights it is possible to see a difference (if you select one woman and one man this is easy). However – if the sample is RANDOM then the standard error will include the uncertainty contained within a small sample and you will not cross a thresh hold of significance. A random sample of men and women can be very small and show a height difference in favor of men because that is the real population result. The probability of a significant result in the wrong direction goes down dramatically as the sample size increases and in a case like this without running the numbers I bet the probability of women being taller than men is 1/1000 with just a random sample of about 5-7. So smaller samples have a higher chance of random outcomes in the wrong direction or magnitude but if they are random than standard error should account for that. I don’t see that as a particular criticism of methods – it only became a problem with repeated samples as you sift through the data and the fact that the comparisons are compromised by lots of regression analysis. However, even in this case there exist known methods to account for repeated sampling or testing but they tend not to be employed by economists or epidemiologists because of the difficulty of deciding what counts as a trial. I think a bigger problem might be how rarely the data are random.

I think a new p-value threshold is an interesting idea but I think decreasing the holy standing of the p-value would also be important. Robustness analysis, being more clear about how many analysis were done to get the results, etc I view as ultimately more useful. In my own reading I tend to review risk factor studies by assuming the the values in the confidence interval closest to parity represent the real strength of the association. So if someone says eggs increase risk of heart disease by 10 (1.05-30) the real association is 1.05 and within the range of noise and so meaningless.

James Pass
Jan 25 2018 at 2:23pm

Dr. Golabki wrote: “More importantly though, is the funding problem. Who pays for replication work?”

You beat me to it. It’s hard enough to get funding for the original study, let alone replication studies. Funding problems are also a reason for small sample studies.

Sometimes I wonder if we focus too much on economic studies, partly because many of the studies are contradictory and partly because they can distract our attention from developing workable solutions.

For example, consider all the contradictory studies and claims on minimum wage. How about we get some notable economists with different philosophies (liberal, conservative, libertarian) and hash out various approaches to our most pressing issues? How about we have a national discussion about important issues? A majority of Americans may be uninformed and apathetic, but there are plenty of Americans who are informed and care, more than enough to sustain a national discussion.

Debates on minimum wage studies are a proxy for the larger issue of being able to support oneself without any assistance from government, charity or family. What should society do for people who work full-time but still can’t afford the basics of food, shelter, medical care and transportation? Does arguing over minimum wage studies help us develop a comprehensive solution to problems like this?

James Pass
Jan 25 2018 at 2:48pm

Kevin wrote: “The authors of the paper were appropriately modest and said that this was preliminary and should not be considered real. But there it was in a textbook for people to memorize for exams.”

Wow, that’s unfortunate. Just the other day I heard yet another reference to the sad fact that many people still believe there is a link between child vaccines and autism, even though it was found that the study showing a link was not merely faulty, but fraudulent.

Kevin wrote: “if the sample is RANDOM then the standard error will include the uncertainty contained within a small sample and you will not cross a thresh hold of significance.”

I could be wrong, but I think the point being made was that some researchers do NOT include the standard error in small samples and therefore find significance even when there is none. Or, the results are reported in journals (or newspaper articles, magazines, textbooks) without also mentioning the standard error (which you could see only if you examine the original study). Statements like these are exactly why I found some of the discussion quite shocking.

Russ has always admitted that he’s very skeptical about studies, and sometimes he admits that he might be too skeptical. Mr. Ioannidis said he has found problems with many studies, but I don’t know if he is as skeptical as Russ.

Russ Roberts
Jan 25 2018 at 6:12pm

Dr. Golabki and Greg G,

I don’t consider the FDA the gold standard that makes the pharmaceutical industry successful. Economists have argued for a long time that the FDA is too conservative and that its caution has resulted in many needless deaths.

As for the general tension between the two principles, I reconcile that tension by conceding that my views on the efficacy of government are based on certain principles I believe to be true and that may be inaccurate in their magnitude or generality. They are not scientific. I have no pretense about their precision. I have a philosophical perspective based on first principles about the nature of human beings, power, competition, and so on. I try not to be overly optimistic about what kind of world would result if my principles governed how the State’s business is conducted and the scope of that business. I recognize that is easy for me because I have no power and there is no historical example of those principles being implemented. My confidence, such that it is, is based on a belief in the reliability of my underlying principles. I think that approach–recognizing that the economics of public policy is more like history than physics is better than those who thinks it’s more like physics than history.

Jan 26 2018 at 9:37am

One general comment: this podcast focuses a lot on reasons to be skeptical, and it would be nice to have some episodes where the focus is less about poking holes and more about what the right way forward is. This episode started to get to that at the very end, but there wasn’t much time left.

And one specific area where I’m confused and the commenters might help. I don’t know what the upshot is for how to behave given the upward bias in statistically significant magnitudes. And is the answer to that different for consumers of research and the researchers themselves? I would think that knowing there’s an average overestimate of magnitudes in sig. results means if I read a study I should somehow apply a reduction to the results. On the other hand, if I’m a researcher who measures something and finds a significant effect, I wouldn’t think that’s the right approach — I would still think my point estimate is still the best possible estimate given the data. (Obviously assuming I’m doing everything else right.) And note that the bias comes not from overemphasizing the high results but discounting the low ones, so reducing my estimate would be counterintuitive. Am I thinking about this right? And if so, how do I reconcile that difference?

Greg G
Jan 26 2018 at 10:48am


Thanks for the reply. I agree with everything you said there about the tension between principles. One of the reasons I am such a big fan of EconTalk is that I think you are one of the very best and fairest interviewers in the business. I like the fact that you warn us constantly that confirmation bias can affect everyone and you do not claim any special exemption from it.

Of course, as economists are often the first to realize, there are always trade offs. Most people are not skeptical enough about empirical studies claiming scientific authority and you are doing important work to point that out. But at least an empirical study will sometimes turn out different than we expect. An application of our philosophical principles is even less likely to challenge our beliefs.

In the end though, there is less certainty for all of us than we would like and we still all rely on our intuitions to bridge that gap.

Dr Golabki
Jan 26 2018 at 5:03pm


I greatly appreciate the reply.

One point of clarification. I didn’t intend to call the FDA “the gold standard. I meant to say that “the gold standard” for clinical (or other) studies is well controlled, appropriately powered, double blinded trials with pre-specified endpoints. In late stage clinical trials that’s really the norm (although not universal), I think because of the FDA.

I’d love to hear an econtalk episode where you go deeper on this. One interesting example was the 2016 approval of EXONDYS 51 (a drug for a horrible genetic disease), where patient advocacy and political pressure seemed to trump the data (at least in the view of some).

James Pass
Jan 26 2018 at 6:00pm

[Comment removed. Please consult our comment policies and check your email for explanation.–Econlib Ed.]

Jon B
Jan 27 2018 at 6:38pm

Its interesting to watch the “experts” in the Neurosciences struggle with the issues of adequate power and replication. The standard responses are larger studies and more centralized data collection so that it is easier to compare studies and aggregate data across sites.

Great idea but because the NIH/FDA/academic industrial complex to some degree is crowding out independent thought with layers and layers of regulatory review and incestuous peer-reviewed funding, this process simply accelerates a monoculture of research methodology.

To some degree, there is no such thing as noise. We live in a deterministic universe on macro scales. We need to invent better research tools so that it it not necessary to have large expensive studies. We then measure precisely what we need to know,.

Replication discussions are so often a waste of time. Sorry. Please spend our finite intellectual capital on inventing more powerful research tools where an N of 1-10 is adequate to drive scientific advances.

Let a thousand flowers bloom. Make mistakes. Produce terrible studies. This is how we have learned for centuries….

Jan 27 2018 at 8:06pm

@Golabki & all

Who pays for replication work?

Didn’t you listen? There’s a replication crisis going on – this means there’s too much funding that pays for junk studies.

Whoever funds all that (hopefully more the interested enterprises and less the taxpayer) should redirect some of those funds towards replication – the details can be worked out.

Jan 29 2018 at 11:17pm

Hello Russ,

I really enjoyed listen to this episode and thumbs up to Professor Ioannidis for his work which is way overdue in the field of empirical finance and economics.
I would be interested if John has actually found any meaningful difference in the power of the studies, published by tenure-track candidates vs tenured professors. If pressure to publish really drives some of these results, the former group should stand-out in this respect.
I replicate a lot of academic papers in my role in a quantitative investment firm and have made the (non-quantitive) observation that younger academics (often on tenure-track), particularly at prestigious universities produce non-replicable statistical work.

Comments are closed.


EconTalk Extra, conversation starters for this podcast episode:

This week's guest:

This week's focus:

Additional ideas and people mentioned in this podcast episode:

A few more readings and background resources:

A few more EconTalk podcast episodes:



Podcast Episode Highlights

Intro. [Recording date: December 21, 2017.]

Russ Roberts: Before introducing today's guest, I want to encourage listeners to fill out our annual survey voting for your favorite episodes of 2017, as well as telling us about yourself and your listening experience at EconTalk. So, please go to and in the upper left-hand corner you'll find a link to the survey. Thank you for a great year. And I hope to make 2018 better.


Russ Roberts: My guest is John Ioannidis.... His 2005 paper from PloS [Public Library of Science ] Medicine, "Why Most Published Research Findings Are False," has been the most accessed article in the history of PloS, with 2.5 million hits. And he claims, or perhaps concedes, that he loves to be constantly reminded that he knows next to nothing. And added to that, I try to embrace as well. He is also the author, along with T. D. Stanley, and Hristos Doucouliagos, of a recent paper in The Economic Journal, titled "The Power of Bias in Economics Research," which is going to be our main subject for today, although I'm sure we'll get into many other things. John, welcome to EconTalk.... Now, we are in a very interesting time for science and social science. And it's been a subject, what sometimes is called the replication crisis in Psychology, and now spreading into other fields. It's been a frequent topic on this program. It comes down to the fact that--going back to your 2005 paper--a concern that indeed most published results are false. Let's talk about what you've been examining recently in the economics literature in your recent paper and what you found.

John Ioannidis: So, in the paper that we just published, we looked at all the meta-analysis on economics literature topics that we could identify, and we found 159 topics that had been subjected to such evidence synthesis of whatever data had been available. That included more than 6700 empirical studies, and about 64,000 estimates of economic parameters. And, we basically tried to use these evidence syntheses--the meta-analysis--as a tool to understand, first of all, how big are the studies done? How well-powered are they to detect, kind of, average, typical effects that might be circulating out there? And, also, what would that mean in terms of estimating the potential for bias that could be generated? So, how different would the results be if one were to focus on well-powered studies as compared to the full mix of all studies that were available? So, we saw a pattern that we have seen in other fields, in a sense that the footprint of the economics literature that we analyze was pretty similar to the footprint of neuroscience literature. Even though economics has very little to do with neuroscience, apparently. They both have the same pattern of using mostly small studies and underpowered studies. And having pretty similar patterns of bias. So, we estimated a median statistical power of about, um, 18%--18. And we also found that it is plausible that the large majority of the reported effect size is, in these topics, are substantially exaggerated. It's very common to see an exaggeration of 2-fold. And about 30% or a third would be inflated by 4-fold. So--


Russ Roberts: Well, let's back up a little bit for listeners who aren't as versed in these kind of issues as some might be. I want to talk about two issues, one very basic, and then, one quite--it's also basic but it's quite challenging. So, I want to start with the meta-analysis. So, when you looked at 64,000 estimates, you were not personally combing through hundreds and hundreds and hundreds and maybe thousands of papers and looking at, say, to take one example, the impact of the minimum wage on employment in 132 different studies. What you did is you took one study that looked at all these studies, and used that as your basis. So, meta-analysis is often justified as a way to avoid the problem of: 'Well, it's just one study. So, we're going to not just look at one study and not claim it's the best one or the one that confirms my biases.' But rather, 'I'm not going to just look at this one. I'm going to look across the whole literature and look at an average effect.' And you were doing this across--you were looking at these kind of meta-analyses across all kinds of--an incredibly diverse set of economic, areas of economic research, right?

John Ioannidis: That's true. So, these are topics which someone else had already decided that a meta-analysis was worthwhile doing, and had done one. You know, at the same time, this gives the advantage of having some information that has already been collected. But we also need to reanalyze it in a standardized fashion, so that the calculations would be compatible across all these 159 topics. So, you know, you can use these data to apply the same methods for synthesis and for understanding the weights and the heterogeneity and perhaps some tests for publication bias, using the same tools and the same methods across all these topics.

Russ Roberts: And--

John Ioannidis: And the fact that there is a meta-analysis already means that we are talking about a literature that may not be fully representative of everything that exists in economics. So, these meta-analyses are heavily, um, predominant for observational designs, not necessarily experimental designs. But this may not necessarily be so far off from the overall economics literature, where observational nonrandomized designs are far more common compared to experimental designs. And a second issue is that, these meta-analyses pertain to topics that probably on average have more studies than the average topic that has been studied in the literature. That, there's probably a lot of topics that you have a single paper and nobody wants to do a second one.

Russ Roberts: Correct; and these are--

John Ioannidis: And it's not going to be the typical situation that you will see in a meta-analysis, where you have 130 studies. Even though these 130 studies will not be exactly the same.


Russ Roberts: I just want to digress for a minute, and then we'll go deeper into the issue of power and statistical significance. But to digress for a minute: Is there--what's your feeling about the--you notice I said 'feeling.' It's a somewhat subjective question. What's your feeling about the use of meta-analysis as a way to overcome this issue of, just, "it's just one study"? Isn't it--if everyone is using the same methodology and has made the same mistake, a meta-analysis isn't any more comforting than a single study. So, what do we know about--do we know anything about, say, the meta-analysis of some area of economics? Is there any reason to think that a meta-analysis is more reliable than a random draw from the individual studies?

John Ioannidis: So, I need to extrapolate probably from economics and look at the meta-analysis literature and theoretical comparisons of meta-analysis against single studies, large studies, or different types of designs in other areas of science. Because, in a way, meta-analysis has probably been under-used in economics compared to other fields. So, we could find 159 topics, while if you go to medicine, there is about 100,000 meta-analyses that have published to date, or done. And it's a magnitude, or actually 3 orders of magnitude, more, compared to what has been done in economics. So, we have far more experience from meta-analysis in some fields compared to others. And, we know, also, more about the caveats and the strengths and the weaknesses of meta-analysis. I cannot answer in a black and white fashion that meta-analysis is always better than a single study. I mean, obviously, if you have an extremely well done study and everything has been very thoughtfully taken care of, and the way that it's run and analyzed is perfect; and on the other hand you have a meta-analysis--small, messy, horrible studies--I cannot really claim that the meta-analysis is going to be more reliable, just because you have more studies on board. But, in principle, science depends on cumulative knowledge. And that's a very basic premise for science--that we are looking at the totality of evidence. And the totality of evidence is going to tell us something more compared to any single study. Now, a single study may be the best among the lot. But even then, these other studies can still tell us something, because they map a universe; and even their deficiencies are interesting to note. So, what a meta-analysis can do is, one school of thought tried to give you the definitive answer--which I think is untenable because there's hardly ever a definitive answer; we're just trying to approximate closer to the truth, whatever the truth is. The second approach, second school of thought, is that it is a tool to look at the cumulative evidence and be able to compare studies and see patterns--patterns of data, patterns of bias, or footprints of bias. And lead to some interesting hypotheses about why this pattern of data is seen. And what does it mean, and how could we fix the problem if that seems to be the footprint of some problem that is causing this pattern of bias?


Russ Roberts: I bring it up because there are so many areas now in social psychology where a result that someone might have questioned in the past and was told, 'Are you kidding? The evidence is overwhelming. There have been dozens of studies that show blah, blah, blah.' Or blah, blah, blah could be priming, or whatever it is. And then, it turns out all of those dozens of studies have small sample sizes, and it turns out none of them, perhaps, replicate with a large sample. And, I bring that up because there is a growing use of meta-analysis in economics. The issue that's been brought up recently that I think is extremely important is this issue--it's not literally economics but it's work that's been done by economists--is whether deworming a population in a very poor country is going to help their economic future. You take the children, you deworm them; and a study was done by some economists, Michael Kremer and others, who found that it was fantastic. This generated deworming, and this generated a lot of money and the Effective Altruism movement to be donated toward deworming. And then a meta-analysis was done and it found no effect. Now, the response of the people who were in favor of deworming responded by saying, 'Oh, that's--those are bad meta-analyses.' So, it's complicated; it's hard to figure out how the world works, as you and I, I think, both know.

John Ioannidis: Yeah. I think that a meta-analysis has some validity and some problems. It has to be seen on a case-by-case basis, in terms of whether the validity is more than the problems. And meta-analysis is not going to fix the literature that is flawed. If every single study is flawed, you will get a flawed result from the meta-analysis. But, you can still get a sense of what is the impact of these flaws and what is the impact of these problems in design, and how do they comparatively affect the results of different studies. So, it's a wider picture. And in that way, I think it is useful, even if the result is not accurate and it's not credible, it is useful to see how does that universe of studies look like. Sometimes I see meta-analysis where it's very obvious that all the studies are completely flawed, but just by looking at that universe of studies you can really get a better understanding of what is going on here. While, looking at a single study or a single observation, it's not so easy to decide.

Russ Roberts: Yeah; I brought it up more as a digression, not so much as an indictment of your survey. Because the fact that you use meta-analysis--you're not claiming you found the truth, here. We're interested in using these existing meta-analyses to understand broad patterns in the empirical economics literature.

John Ioannidis: Right. And you are looking at comparative patterns. So, you are basically asking: Larger studies, how do they compare to smaller studies? That's a very basic pattern that you can address pretty much across any topic. And it's not dependent on what is the exact question being asked.


Russ Roberts: Now let's look at the empirical finding in your work, that you mentioned: that the average, I think you said, the average level of power in these studies was 18%. Most listeners won't know what that means. I only know what that means because I've been getting ready for this interview; and I confessed to you before we started the interview that, though I was trained as a Ph.D., got a Ph.D. in economics at the U. of Chicago, I never heard that phrase, 'power,' applied to a statistical analysis. What we did--and I think what most economists, many economists, still do, is: we had a data set; we had something we wanted to discover and test or examine or explore, depending on the nature of the problem. And our goal was to find a t-statistic that was greater than 2. Which is technically a measure of what's called statistical significance. And, most statistical significance--meaning a p-value of 0.05 or smaller--and most, if not all--not all, but most published results in many fields using econometric or statistical analysis in a multivariate way, meaning multiple variables trying to explain the pattern in a dependent variable have to get across that hurdle. You have to get a p-value of .05 or less. There has to--it has to be statistically significant. And when you do that, it's golden; and you can publish it, in theory. Not every time, but you've got a shot. If you don't find it, you're not likely to be able to publish it. And so, that, I think most economists today know a lot about that--though we might not define it exactly correctly; I struggle with it sometimes myself. So, that's on the one hand. One the one hand, we're saying--I'm going to let you describe it. So, describe statistical significance at the .05 level. What does that mean?

John Ioannidis: So, I think that we have to be a little careful here, because we didn't really make assumptions about statistical significance at the .05 level here, for these meta-analyses. What we tried to ask is: What is the power of a study to be able to get a result that would cross that level of statistical significance at the 0.05 level, if the true effect out there is x? And now the question is: How do you know the true effect? I mean, nobody really knows the true effect. There are different ways to approximate it, and one way to approximate it plausibly is to say that: Well, maybe if you consider all the evidence, then the true effect is best represented, or best approximated, by all the evidence. That's the best shot that we can have. A second approach would be to look at what are the effect-sizes in the largest studies, and then the question is to define what exactly do we mean by the 'largest' studies? And one approach is to look at, for example, the top 10%--the 10% of the reported estimates from the most precise studies, the ones that have the least uncertainty in their estimation. The other is to take the top one, which is the most precise of all--so, the largest study, in a sense, the one that has the least uncertainty. And the third is a more sophisticated approach, which we call PET-PEESE--Precision Effect Test-Precision Effect Estimate with Standard Error. Which basically is a regression; and it tries to, in a way, estimate what would have been the effect if you go towards an infinite-sized study. So, it's extrapolating from what we have to the ideal, very large study: What would it look like? So, there's different ways to approach what might be plausible effect sizes. And then, you ask what is the power to detect these plausible effect sizes. Power, meaning: If that effect is there, how likely is it that with the type of sample size that I have in a given investigation, in a given design, I will be able to get a statistically significant result that is less than 0.05. And this is what the power is, practically [?]

Russ Roberts: Let's do that again. Let's say that again. So, let's try to put it in the context of an actual empirical question that might be examined in economics. One of the ones you mentioned in the paper is the impact of a minimum wage on employment. And a caveat: Of course, there are many other aspects and impacts of the minimum wage besides whether you have a job or not. It can affect the number of hours; it can affect the training you receive; it can affect the way you are treated on the job. And it bothers me that economists only look at this one thing--this 1-0 variable, job-or-not. Number of jobs. Without looking at the quality, outside of the monetary, financial aspect. But, that's what we look at, often. And it is the central question in the area of minimum wage policy: Does it reduce or even expand potentially--which I think is crazy, but okay, a lot of people don't agree--whether it expands or reduces the number of jobs. Now, in such an empirical analysis of the minimum wage, how would you describe the power of that test? Meaning, there's some effect that we don't know of that impact. The power is--fill in the blank--the probably that?

John Ioannidis: Right. So, for that particular question, the median power if I recall that we estimated was something like 8 or 9%.

Russ Roberts: It is. I looked at it; I've got it right here. It is 8.5%.

John Ioannidis: There you go.

Russ Roberts: That means--so, what does 8.5% mean, in that context?

John Ioannidis: It means that, if you estimate for each one of these studies that have been done, what are the chances that they would have found that effect? That they would have found a statistically significant signal, if the effect is what is suggested by the largest studies, for example? Their median chance would be 8.5%. So, 50% of the studies would have 8.5% chances or less to be able to detect that signal. Which is amazing. I mean, if you think of that--

Russ Roberts: It's depressing--

John Ioannidis: Or depressing, actually. I mean, they basically have no chance of finding that. Even if it is there.

Russ Roberts: So, does this work on both sides of the question?

John Ioannidis: It is very, very difficult for them to pick it up.

Russ Roberts: Does this work on both sides of the question? Meaning: It obviously depends on your null hypothesis. So, if your null hypothesis is: Minimum wages have no effect, and I'm going to test whether they have an effect, you are going to say: Does that mean I'm going to find that I only have an 8% chance of finding that effect?

John Ioannidis: Yeh. It would mean that even if that effect is there, you would have an 8.5% chance of detecting it.

Russ Roberts: So, most of the time, I would not find it.

John Ioannidis: So, most of the time you would find a non-significant result. Called a null result. Or, seemingly null result. Even though there is some effect there.

Russ Roberts: But it could go the other way, too. Because your null hypothesis could be that the minimum wage has an effect; and I'm testing whether there is no effect. And I might not be able to find no effect. Is that correct to go in that opposite direction?

John Ioannidis: So, what happens in the opposite direction is that when you are operating in an underpowered environment, you have two problems. One is the obvious: That you have a very high chance of false negative. Because this is exactly what power means. It means that 92%, if you have an 8% power--92% of the time, you will not be able to pick the signal. Even though it is there. So, it's a false negative. At the same time, you have the problem of having a very high risk of a false positive when you do see something that has a statistically significant p-value attached to it. And, it could be an entire false positive, or it could be a gross exaggeration of the effect size. And, um, it could be that the smaller the power that you are operating with, if you do detect something, even if it is real, the magnitude of the effect size will be substantially inflated. So, the smaller the power, the greater the average inflation of the effect that you would see, when you do detect it. So, two major problems. With low power: lots of false negatives. Second problem: lots of false positives and gross exaggeration of the effect sizes.

Russ Roberts: Yeah, I think--

John Ioannidis: And you add a touch of bias to that, and obviously there are many different biases. But many of the biases that operate, have their common denominator that people are trying to find something rather than trying not to find something. It makes sense--

Russ Roberts: Well said.

John Ioannidis: So, someone is trying to maybe sometimes change the analysis a little bit or try another analytical mode, add some more observations or do a few more experiments or keep trying until they get the statistically significant p-value, somehow. So, if you add this sort of bias, which, based on what we have seen across multiple fields seems highly prevalent, then the rate of the false positives and the exaggeration really escalate further. And they can really skyrocket pretty quickly--

Russ Roberts: and as a result--

John Ioannidis: unless these biases are contained pretty thoroughly.

Russ Roberts: As a result, you get these dramatic papers with these huge impacts, some variable, some policy. And they are not reliable. I think Andrew Gelman calls this a Type M error, where M is magnitude.

John Ioannidis: Magnitude.


Russ Roberts: So, here's the part that's confusing for me, and I think I have some understanding of it, but I find many economists literally do not understand this at all. And certainly everyday normal human beings are going to struggle with it. So, here's the question: Say, you have a "small sample"--and of course, 'small' is--it depends on the size of the magnitude I'm trying to measure. And all kinds of things as well. But I'm going to use that phrase. I have a sample that--a better way to say it, is it is going to be under-power. But let's just say it's small to start with so that people can understand what I'm talking about. So, I have a small sample. I take a sample of--let's say I want to figure out whether men are taller than women. And so, I go out and I sample 10 men and 10 women. And, you know, I could find lots of different things in that sample. I could happen to have chosen 10 relatively short men and 10 relatively tall women. And it would look like women are taller than men. But that result, given that there are only--there would have to be a very big difference given the size of the sample--by definition, statistical significance is going to take account of the size of the sample. So, I might find that women are taller, but it's unlikely in a small sample it's going to be statistically significant. Another example people use sometimes is a fair coin: If I flip a coin 100 times, I might get 55 heads. In fact, I'm going to get 55 heads fairly often out of 100 tosses. Doesn't mean the coin is biased. It's just the sample is not large enough to measure whether the coin is fair or not. So, a lot of times then, what economists do--and psychologists as well, and other folks--when they get a small-sample statistically significant result--in other words, they find it's statistically different--it's unlikely that these data were the result of just chance, they then say, 'Wow. If I found it with a small sample, just think how statistically significant it would be with a large sample.' So, when economists find statistically significant results in small samples--and the definition of small here is going to be essentially underpowered--they are going to say, without looking at the power, they are going to say, 'Hey, look how great this result is. You can't deny it because it's even true in a small sample.' And then you come along, and Andrew Gelman, and others, and say, 'Actually, it's the opposite. With a small sample, the more likely it is that what you found literally isn't true.' So, can you try to explain that intuition? Sorry for the length of the question.

John Ioannidis: Yeah. So, this is what we call the Winner's Curse. And it's pretty much the same phenomenon that I was describing earlier, that, if there is a signal, a true signal, to be detected, and you are running in an underpowered environment with very small studies, with very few observation like the 10 and 10 sample that you described, if you find it, you will find it in a way that will present itself in a much bigger magnitude compared to what it really is. Because, if it presents the way that it really is, it will not be significant. So, you will not detect it; you will not say, 'Eureka!'; you will not open a champagne bottle. But if you are lucky, or unlucky in a way--if you have this Winner's Curse to chance upon a configuration of the data where this is very prominent, then you will say, 'Wow. Look at that. This is fantastic. This is amazing. This is huge.' But, you know that the true effect is going to be much smaller. Now, it could be much smaller or it could be nothing at all.

Russ Roberts: So, that's the question. I understand it could be smaller. The hard part, I think the intuition is--and I guess, just to back up for a second: I understand why in a small sample I could have a false negative. I could say, 'Uuup, there's nothing there.' But, come on, you only have 10 women and 10 men; let's say they came out to be exactly the same height. You say, 'Well, I guess women and men are the same height.' That would be silly, because your sample was too small to find it; and it's underpowered. And you are likely to have a false negative. Why am I likely to get that significant result in that finding, and that it's a false positive?

John Ioannidis: So, I think that it could be either a false positive or an exaggerated--sometimes grossly exaggerated--effect, depending on how small the sample is that you were working on. It depends on what is the pattern of effects circulating across the field at large. So, if someone is working in a field that, let's say, there's a lot of prior evidence and very strong theory and other types of insights that have really guided us to create questions where the answers to many of those are likely to be non-null effects, then you are likely to fall into the pattern of just finding an exaggerated magnitude of the effect size rather than a complete false positive. If you are working in a field where you are just completely agnostic--black box, just searching in the dark, and actually in a field where there's not much to be discovered, just tons of noise--practically, then, if it's all noise, no matter what significant results you get it will be a false positive. So, there is a continuum here. There is a continuum of different fields and different priors of how many out of 100 or 1000 or 10,000 hypotheses that we are testing are likely to be hiding something that is genuinely not known [?no-no?]. And, there is a lot of variability in that regard. I think that economics is mostly operating in, let's say, middle ground. But there is a lot of variability. I think that people, for example, who go to do a very large, randomized trial that is very expensive, most of the time I would argue they have thought very carefully that that's not going to be a waste of money. And they have a decent sense of showing something--

Russ Roberts: --that's real.

John Ioannidis: I don't think that someone would do a randomized trial--

Russ Roberts: --that's real. They are going to [?serve?] themselves, then, trust me. But, you are saying--

John Ioannidis: Yeah. Yeah. I think that if they had a chance of 1 in a million of finding something, then they say, 'I'm going to do a trial that is going to require $50 million dollars to run,'--I don't think that that would be a good investment.

Russ Roberts: Correct.

John Ioannidis: Conversely, there's other fields where we're in a completely agnostic mode and we just ask hypotheses like crazy. And we ask millions of such hypotheses. And this is very common in big data science. And we know that the yield is going to be very low. Which is looking through a haystack and there's a few needles in there. So, these needles are few. And most of what we are going to detect is likely to be a false positive, unless we find ways to further document that what we have found is really true. Which means, typically, doing more such studies; having very stringent statistical significance thresholds; requiring very stringent replication to see it again and again. And then we can say, 'Well, no; that's true.' So, there's a continuum. And each field is operating in a different point within that continuum. I think most of economics research, I would dare say is operating somewhere in middle values of that continuum--so, not completely agnostic, and not very high prior. But there is a range; and different studies may be at higher/lower levels within that range.

Russ Roberts: To come back to this question of this intuition of discovering a result that's probably spurious--a false positive or a large false positive: The way I would read your perspective on this is that there are two sources of that mistake. One is just noise. Sometimes you are just going to draw from the urn of life a particularly unrepresentative result. But the other is publication bias--that 'I'm going to keep changing my specification, adding variables, changing the sample,' etc., to make sure that I can get a published result; and I'll strangle the data until it screams. In which case I would get that statistical significance. And I assume it's both of those working together. It's not just one or the other.

John Ioannidis: Absolutely. And, there can be different terms about what you just called publication bias. I tend to use the term, 'significance chasing,' or 'significance chasing bias,' or 'excess significance bias.' But, there's so many terms that have been coined in different fields. Trying to describe pretty much the same phenomenon--

Russ Roberts: P-hacking--

John Ioannidis: People have seen that this is--p-hacking is a very popular term in psychology and other sociological sciences. But, it's just a fact that people have seen that this is a major problem, and have coined these different terms to try to describe it.


Russ Roberts: So, when you used the metaphor of a needle in a haystack--that there might only be a couple in a big data set, actually, I think, maybe, a different metaphor is that there's an infinite number of needles: There's all these correlations that can look significant in a data set of large size. And most of them are not meaningful--that is, they are not replicable, they are not going to replicate it; they are just the product of randomness. Is that--would that summarize--that summarizes my worry about big data. What do ¬you think about it?

John Ioannidis: So, yes. I mean, I probably wouldn't use the term 'needles' to describe this, because needles would mean that they are true. But in a universe of big data, you are entering an environment that has the opposite problem of what we are describing in these meta-analyses that belong mostly to the past--well, they belong entirely to the past. But, they are meta-analysis. Most of the analysis in the past were small studies. They were underpowered; they were at risk of these false positives, and false negatives, and exaggerated results. Now, we have more and more big data studies, which are over-powered, and where, again, just testing with the typical statistical tools that we have, nominal significance means close to nothing. It's likely that any analysis will be statistically significant one way or another. And, then you don't really know. Then, statistical significance has very little discriminating ability to tell you which ones are the real needles and which are just flukes.

Russ Roberts: So, for all graduate students and professors listening to this in economics, and any other field: When I now go to an empirical presentation, or a presentation of an empirical paper, I ask with a straight face, 'How many regressions did you run?' You know, the table--at the end, I get a table. And the table has got all these asterisks. And the asterisks are all significant at the .005--significant at the point 0-0-5--significant at the point--.005--you know, it's just full of significant results. And I say, that's lovely. But, how many regressions did you run? And it's such a startling question, the couple of times I've had a chance to answer--ask it. They don't answer it. It's not because they are embarrassed. It just never crossed their mind. It's not even a question. So, the problem, I think, in our field, and others--epidemiology being another example--is that there are so many opportunities in the kitchen, to do, whether it's p-hacking or what Gelman's called the Garden of Forking Paths. I have so many decision nodes to try different things. And unless you watch the videotape of how the food was prepared, you have no idea if it's safe or not.

John Ioannidis: Exactly. And, much of the time you cannot even count them. So, there are some situations where at least you can count them. Like, genetics, for example. You can count how count how many genetic variants you are testing. You know--if you are honest to yourself, and to others, you know that I am testing 10 million variants and you know what their correlation structure is. And you can use a formal correction for that. Either just a multiplicity correction, or some other way with a false discovery rate, or something equivalent, that will take care of the exact multiplicity burden that you have. In many other situations we don't really know exactly how much multiplicity we are dealing with. I mean, we are probably fooling ourselves, because we are going down that garden of forking paths, and we lose count--down the path of how many nodes did we need and how many options were there in each node? And how many choices did we make? And, many of these choices could be even subconscious. Or, mild, modest modification of one analysis, versus the original one. So, it's very difficult to estimate the exact multiplicity burden in that case. It's--you know it's there. But, you can't really put a number. You can't really use some direct method to correct for that multiplicity.


Russ Roberts: So, if you're giving advice to a young scholar in any of the fields we are talking about--and I guess it, could argue it's every scientific field in a certain dimension. But, let's talk about observational studies, as opposed to random control trials. So, they have their own separate sets of issues. But, people who are doing what for now, for, I don't know, 80 or 90 years has been classical statistics. And I'm a skeptic. Right? I've carved out that niche. And it's a dangerous niche. Because, if you're not careful, you just reject everything. You say, 'Oh, we can't know any of this stuff.' And that's obviously not true. And I don't really believe that. But I am highly skeptical of these observational studies. Should I be? When someone presents me with a result? What should I, as a practicing economist or practicing epidemiologist--what advice would you give us for trying to figure out what's true?

John Ioannidis: Well, I would probably go back and ask: Is an observational design having any real chance of giving us some reasonably decent realizable answer here? And there may be many situations where they could give an answer that is fairly reliable. I mean, it's unlikely that it will be conclusive and definitive--in a way, nothing is 100% definitive. But at least a high enough in that scale of being definitive that you can take it to the next step. There are some situations where, when you just think about what are the odds of getting it right, maybe some designs are just not to be used. You should not use them. You should just abandon them. For some types of questions. To give you one example, we have performed hundreds of thousands of studies trying to look whether single nutrients are associated with specific types of disease outcomes. And, you know, you see all these thousands of studies about coffee, and tea, and all kind of--

Russ Roberts: broccoli, red meat, wine--

John Ioannidis: things that you eat. And they are all over the place. And they are all over the place, and they are always in the news. And I think it is a complete waste. We should just decide that we are talking about very small effects. The noise is many orders of magnitude more than the signal. If there is a signal. Maybe there is no signal at all. So, why are we keep doing this? We should just pause, and abandon this type of design for this type of question.

Russ Roberts: We'd like to know. And that desire to know is so strong.

John Ioannidis: Of course. Of course. But, to know, we need to use the right design. So, I would argue for this type of questions, where the error is 50 times bigger than the signal, we need to find designs that minimize the error. And, our best chances in these cases, if we still believe that it's be important to know, they would be randomized trials. Or at least experimental trials that minimizing confounding minimized error as much as possible. Even those may not be able to get us an entirely definitive answer. I'm not saying that they are a panacea. But, at least we know that we are not starting completely off base. Even knowing that we will get a [?drunk?drone?drown?] no matter what. There's other cases where observational designs may be very useful, and very illuminating. There's sometimes effect sizes that are big and situations where we can have a pretty good understanding of what the confounders might be, and what is really influencing what. And, in that case, they are definitely having a role. So, we never got a randomized trial to prove that smoking causes cancer. But, smoking increases the risk of cancer 20-fold, as opposed by 1.001-fold that many of these nutrients do. So, I would never argue that we need a randomized trial for proving that smoking is a bad thing for us. It has to be seen on a case by case basis. But, there is a lot of observational research that is really going beyond the performance characteristics that are being used. And I'm not sure that this is a good investment. One could always say that I do this for exploratory purposes and just to get a preliminary insight. But, I worry that much of the time we just don't get any preliminary insight, and even, these data that emerge are just biasing our thought.

Russ Roberts: Yeh, I agree.


Russ Roberts: I want to go back to Big Data for a minute, and just a general question in how one should think about empirical work. A lot of younger economists have told me that, 'Theory is over-rated. We just need to look at the data and see what the data say.' And, 'The data will speak.' What's your thought on that? And that's part of--by the way--the appeal of machine learning and Big Data, is that, 'Our theories are imperfect, so we'll just see what the reality is,' is the way they, I think, think about it. What's your thought on that?

John Ioannidis: Well, I'm not saying not to look at big data, but looking at big data, you see the patterns in the big data. This is not the same as saying that you see the truth or that you see causal effects or that you see the answer to important questions. You see patterns. I am very eager to do that; and I do waste a lot of my time looking at patterns in big data. But I want to be honest to myself that I am just looking at patterns. I'm not looking at the final frontier. And, these patterns are sometimes very difficult to interpret, and based on different theory, they would be interpreted very differently. So, I don't think that we have the end of theory; I don't think we have the end of statistical testing in any means, as well. But, big data have to be seen with a lot of caution. I think that we really need proofs of principle that these sorts of analyses eventually do help and are useful. So, it's not just an issue of, is it true or not, but also an issue of: Does it help, and can you build, for example, policy and decision-making on them? And, to be honest, I have seen very few examples where you can build reliable policy and decision-making based on Big Data. I mean, you can probably mislead your policy very easily with Big Data; and you can mislead in any of a gazillion ways that you may want. But, I would like to see more concrete examples where that would really be helpful. For the time being, I see it more as exploring an interesting space: learning about the data, learning about the patterns, learning about their errors, their biases; how we can fix some of these errors. So, it's like a machine that is still to be probed, and try to see what can we make out of it.


Russ Roberts: So, given your skepticism about many research designs and the nature of the complexity of the world, one of the issues that I struggle with is people then assume I'm against science. I know, you are laughing out loud. But they say it about me all the time. And I also say, I also make the argument that very few--maybe zero--questions in economics have been settled by a single great study. And I think that's true of science, generally, by the way--it's not an economics problem. That, empirical work tends to build up over time. But, even in economics, there's always a loophole. There's always a way to say, 'Oh, yeah, but that was after the war. You see, after the war...,' there's always--we don't typically do what I would call real science with experimental, real control trials, even in the ones that we call 'real control'--'randomized control trials.' They are subject to the location. They are subject to the context. They are subject to the way the instructions were given. So, I'm just--I'm overly skeptical, which is again I concede maybe a flaw. But I don't believe that evidence or facts are irrelevant. I do believe I've changed my mind about lots of things. It's just not when I open up a study of econometrics [?Econometrica?] and go, 'Well, I guess I was wrong.' How do you handle that? Do you get a lot of that, or not?

John Ioannidis: So, I think that there is a risk that you may get pushback by people saying that if you kind of disseminate a picture of science getting it wrong, and having so many problems and so many biases and so many difficulties, then you may offer ammunition to people who say that science is not worth it. And, of course, this is a risk. But, at the same time, in a way, this is the way that science works. I mean, science is not working with dogma. It is not working with absolute truth. It's working with some healthy skepticism. It's working with the desire to reproduce and replicate what we see, and document it very carefully to diminish biases, to improve methods. So, this rational and to some extent skeptical thinking is at the core of the scientific method. I don't think that we should abandon the scientific method or distort the scientific method so as to give it a message that science is perfect, because that's not what it is about. It's a very difficult endeavor. It's fighting and struggling with errors and biases on a daily basis, and trying to do our best, and getting as close to the truth as possible. I think also that if we go along the narrative of 'Science is perfect,' whenever you have these debates and contradictory data and big promises that are not fulfilled, then science becomes a very easy target for the wrong reason. And people say, 'You promised me that,' or 'You told me that, and now this is not so.' And we have not really made any cautious announcement ahead of time, that, 'Well, we know that with not perfect certainty'; 'We know that this is maybe 60% likely to be true, but there's a 40% chance of error.' Maybe there's a 70% chance of error. Unless we are accurate about our level of uncertainty, I think we will run into trouble. And I think we are running into trouble. And, in medicine, we see that all the time. You can have just a single paper that got it wrong--like, Lancet publishing a paper that MMR [Measles, Mumps, Rubella] vaccines cause autism. And then you have hundreds of millions of people who don't want to vaccinate their children. And, we're heading back to the Middle Ages. And the problem started from getting it wrong, and not having a message that we could get it wrong, and, you know, 'Some of our papers and our top journals could be wrong,' and that was not just wrong. It was more than that: it was actually fraud--which is not so common. So, how do we give an accurate picture of what science is? Which, to me, is the best thing that has happened to Homo Sapiens--sapiens. But, it's difficult. And it does have errors and biases; and that's what we're struggling with every day.

Russ Roberts: Well, I interviewed Adam Cifu, who is, you know, co-author with Vinayak Prasad of the book Ending Medical Reversal. And what 'medical reversal' is, is there's this idea that when a study comes out saying 'This is good,' or 'This is bad,' and people take it, 'Well, it's peer-reviewed so therefore it must be true.' And then--that's an observational study--when they go and do the randomized control trial, they find out that the result is the opposite: You shouldn't do that technique, or you should do something else. And, I just think it's--I think it's in fact an extraordinary thing, actually, given our power of reasoning, that we have so many false positives and false negatives because of our love of science and statistical sophistication. It seems like a big challenge for us to overcome that.

John Ioannidis: Mmhmm. Mmhmm. Yep.


Russ Roberts: Now, a lot of people suggest that we should change the level of statistic significance. It's funny--there's no law--there is a law; there's no legislation, as we make that distinction here. It's a norm that 0.05 is the right amount. What do you think of that as a way to--'We should be more demanding. We just should make a higher hurdle for people to get statistical significance.' And then we have people like Andrew Gelman who have said we should just stop talking about it completely. What's your thought on that?

John Ioannidis: So, I was one of the authors in the paper that suggested moving traditional threshold from 0.05 to 0.005--so, adding an extra zero. And, I see that as a temporizing measure. I don't see it as a perfect fix. I think that in many--most--circumstances actually using statistical significance with p-values is not the best way to approach the scientific questions. In a few case it is--maybe, I would say, in the fields that I am working in, which are mostly biomedical but not necessarily so, about 20% of the time, null hypothesis significance testing would be the way to go, indeed. The other 80%, not at all. Or, very second or third type of choice. Why did I co-author that paper? The reason is that we are living in a situation where we have a flood of significance. So, that extra zero is like placing a dam to avoid death by significance. You know, drowning by significance. It's a temporizing measure. Would it solve all the problems? No. But, probably, what we have seen across different fields, about, on average, 30% of those false positives would no longer be false positives, because they would be in that borderland between 0.05 and 0.005.

Russ Roberts: But you are assuming that the authors wouldn't have tried harder.

John Ioannidis: Well, but, then the question becomes: once you have that dam in place, authors would be p-hacking around that new standard. So, instead of trying to pass the 0.05, they will be doing their best to pass the 0.005 threshold. But, this is becoming a bit more difficult for them; and with the current sample sizes that are circulating in most scientific fields, this is not going to be easy. When they do make it, then the bias will be worse. So, the average inflation for exaggerated results would be even more. But there would be fewer such. So, I see it as having some advantages, some disadvantages. Probably on average substantially more advantages at the moment, compared to disadvantages. But it's not the perfect fix. It's not the end of the day. I think that we need to think more broadly about replacing our statistical inference tools with more fit-for-purpose [?] tools, and also moving to the design phase of research. So, designing studies that have a higher chance of getting us close to the truth, with less uncertainty.

Russ Roberts: What do you think about pre-registration, where a scholar would put down in writing somewhere publicly what they are going to be looking at, to reduce the p-hacking work in the kitchen?

John Ioannidis: I'm very much in favor of pre-registration; and I have supported that for many years over a decade, and in various fields. I think that it can help. I think that it has helped in some domains like clinical trials in medicine. Is it perfect? No. About 50% of trials are registered; and of those, about 50% are properly registered; and of those that are properly registered, about 50% report their outcomes; and of those, maybe 50% are well done in other dimensions of their design. So, eventually it trickles down to smaller and smaller numbers that would be protected from various biases. But, at least, it's a step in the right direction. Can we apply to any type of research? I don't think that this is easy to do; and I would be very happy for lots of research that is exploratory just to acknowledge that. So, if something has been obtained through a garden of 14 paths and zillions of analyses and extremely complex meandering paths of thinking, saying that this is pre-registered is just trying to fool others and fool ourselves. What should be conveyed about this research is that it was entirely exploratory, extreme data-dredging at its best; and that's fine, provided that we know that this is what it was. And then, at a second stage, someone could preregister a study that follows that exact same meandering recipe that emerged from that exploration.

Russ Roberts: Different data set, time period.

John Ioannidis: Different data set, different study. Now that you have this very peculiar combination of choices and design and analysis, okay, 'That's what you got; let's try to repeat it and see whether it works.'


Russ Roberts: Anything else you'd like to recommend to editors or young academics for how to make this problem get better? Any policy changes you're in favor of?

John Ioannidis: I think that it's not a one solution that would fit all. There are over a dozen families of solutions that are being discussed, and some of those I have reviewed in some of my recent papers. In a way, some of these solutions could be complementary, or they could co-exist. And one may help another. So, creating a replication culture, pre-registration, data sharing, protocol availability, better statistical methods, picked-for-purpose statistical methods, stronger and more stringent thresholds, different types of peer review, more openness in peer review, more transparency--all of these have lots to share. So, sharing data can facilitate peer review. It can facilitate replication. It can facilitate team-science. It may lead to making pre-registration more plausible. There is a very high correlation between these ideas; and eventually these ideas would work if we have multiple stake-holders who believe that they are worthwhile adopting. It's very difficult for a single scientist to just go out there and say, 'I'm going to do it differently than all of you.' It's very difficult for a single journal to do that. It's very difficult for a single institution to change their practices. But, if people recognize that this is good idea, and you have multiple journals, multiple institutions, multiple funders, multiple scientists who believe that this is the way to go, then we do see change. So, for example, registration for clinical trials had been out there as a possibility for 30 years; but it was not really happening until all the major medical journals said, 'I'm not going to publish your trial unless you have pre-registered it.' And, then funders also joined. And then everybody wanted to do it, because they wanted to have their paper published in the best journals. And the same applies to other fields. Economics has made tremendous progress over the years in terms of some of these transparency practices. Especially the best journals have adopted several of these practices.

Russ Roberts: So, I want to apologize to you. I think I first heard about your paper, "Why Most Published Research Findings Are False," I think I first heard about it from Nassim Taleb--I'm guessing; I'd have to got back and look at it. And I thought, 'Well, that's ridiculous. That's just silly. What kind of a paper is that?' And it was a theoretical paper. It wasn't like you went around and then you went and re-measured it and you showed they mismeasured it. It's a very interesting paper, actually, obviously, and it's a very provocative paper. So, my apology is there is a lot more to it than I had thought from the title. But my question for you is: Since you say you constantly, you want to be constantly reminded that you know next to nothing: You write a paper like that; and then Brian Nosek and his team in psychology finds that only 40% of the top papers in psychology in the last 10 years replicate, you must feel pretty smart. So, how do you keep your humility?

John Ioannidis: Oh, goodness--

Russ Roberts: It's a trick question. Sorry.

John Ioannidis: So much potential for making mistakes and errors. And, you know, just finding biases. Or not knowing about biases that you have in your own work. That, some humility is indispensible. I think that this is what's really interesting, and nice about science--that there's no end to revealing how many mistakes you can detect and you can fix. And, saying that I have detected the final mistake and now I have been doing perfect research, that's very presumptuous. So, I'm trying to not forget that. And I'm trying to keep reminding myself that maybe all of my work is wrong. Who knows?

Russ Roberts: Well, what are you working on? You took on economics lately. What else are you working on?

John Ioannidis: So, as part of the work that we are doing at the Meta Research Innovation Center at Stanford, the big privilege is that we can work across very different types of domains. And, I'm surprised and excited to see that many of the problems that we have seen by medical fields are not just applicable to these biomedical fields. They accure [?accrue?] in very different areas. So, we have a great network of collaborators, and I really enjoy working with people who are not in my core fields, because they can really teach me about what is going on in their field, and what are the issues. So, my collaboration with Tom [T. D. Stanley?] and with Hristos Doucouliagos in that paper was really fascinating for me, because obviously I'm not an economist. And getting to know that literature from an insider view was really fascinating. At the moment I'm working on appraising biases and trying to test out solutions in very different fields. And then, it's--there's really no end to it. I think that there's a lot of exciting work happening in psychology and social sciences. Economics just as well--it has some very exciting leads at the moment. There's a lot of questions on Big Data, on registration of different types of studies, on new designs for randomized trials, for advantages and disadvantages of experimental design versus observational data. On pragmatism. On how do you differentiate between credibility and utility in research? Implementation issues of research practices; reward systems and incentives; trying to network different universities and leadership of universities and funding agencies and re-addressing and re-discussing how do they prioritize rewarding and promoting and funding scientists? So, it's--I feel a little bit like a kid in a candy shop. There's so many things going on. And, all of that is just so exciting.

Russ Roberts: Well, as an economist, although all the--I would call it the nuts and bolts of good science: Transparency, ideas of registration, survey/research design, experimental design--these are all really, really important, and it's important to try to get them right. I would just suggest that it's hard to get them right in a world where we as academics now can make a large sum of money, and we're getting on the front page of the New York Times--it's still a lot of fun, and also the institutions that we work for really like that. So, as long as that's there, your big challenge--and I salute you for taking it on--is: How do you fight against that fundamental incentive? We have this romance about our task that we are just truth-seekers. But we are also human. And those financial incentives have changed so much over the last 50 years, for the mainstream members of economics and other fields.

John Ioannidis: Mmm-hmm. Well, there's clearly some incentives that are misaligned. But, the question is: How can you really realign them? And I don't think there's anything wrong, necessarily, with financial incentives. It's just an issue of: How do you get them to work for you and for better science rather than for more short term gains?