Intro. [Recording date: December 21, 2017.]
Russ Roberts: Before introducing today's guest, I want to encourage listeners to fill out our annual survey voting for your favorite episodes of 2017, as well as telling us about yourself and your listening experience at EconTalk. So, please go to econtalk.org and in the upper left-hand corner you'll find a link to the survey. Thank you for a great year. And I hope to make 2018 better.
Russ Roberts: My guest is John Ioannidis.... His 2005 paper from PloS [Public Library of Science ] Medicine, "Why Most Published Research Findings Are False," has been the most accessed article in the history of PloS, with 2.5 million hits. And he claims, or perhaps concedes, that he loves to be constantly reminded that he knows next to nothing. And added to that, I try to embrace as well. He is also the author, along with T. D. Stanley, and Hristos Doucouliagos, of a recent paper in The Economic Journal, titled "The Power of Bias in Economics Research," which is going to be our main subject for today, although I'm sure we'll get into many other things. John, welcome to EconTalk.... Now, we are in a very interesting time for science and social science. And it's been a subject, what sometimes is called the replication crisis in Psychology, and now spreading into other fields. It's been a frequent topic on this program. It comes down to the fact that--going back to your 2005 paper--a concern that indeed most published results are false. Let's talk about what you've been examining recently in the economics literature in your recent paper and what you found.
John Ioannidis: So, in the paper that we just published, we looked at all the meta-analysis on economics literature topics that we could identify, and we found 159 topics that had been subjected to such evidence synthesis of whatever data had been available. That included more than 6700 empirical studies, and about 64,000 estimates of economic parameters. And, we basically tried to use these evidence syntheses--the meta-analysis--as a tool to understand, first of all, how big are the studies done? How well-powered are they to detect, kind of, average, typical effects that might be circulating out there? And, also, what would that mean in terms of estimating the potential for bias that could be generated? So, how different would the results be if one were to focus on well-powered studies as compared to the full mix of all studies that were available? So, we saw a pattern that we have seen in other fields, in a sense that the footprint of the economics literature that we analyze was pretty similar to the footprint of neuroscience literature. Even though economics has very little to do with neuroscience, apparently. They both have the same pattern of using mostly small studies and underpowered studies. And having pretty similar patterns of bias. So, we estimated a median statistical power of about, um, 18%--18. And we also found that it is plausible that the large majority of the reported effect size is, in these topics, are substantially exaggerated. It's very common to see an exaggeration of 2-fold. And about 30% or a third would be inflated by 4-fold. So--
Russ Roberts: Well, let's back up a little bit for listeners who aren't as versed in these kind of issues as some might be. I want to talk about two issues, one very basic, and then, one quite--it's also basic but it's quite challenging. So, I want to start with the meta-analysis. So, when you looked at 64,000 estimates, you were not personally combing through hundreds and hundreds and hundreds and maybe thousands of papers and looking at, say, to take one example, the impact of the minimum wage on employment in 132 different studies. What you did is you took one study that looked at all these studies, and used that as your basis. So, meta-analysis is often justified as a way to avoid the problem of: 'Well, it's just one study. So, we're going to not just look at one study and not claim it's the best one or the one that confirms my biases.' But rather, 'I'm not going to just look at this one. I'm going to look across the whole literature and look at an average effect.' And you were doing this across--you were looking at these kind of meta-analyses across all kinds of--an incredibly diverse set of economic, areas of economic research, right?
John Ioannidis: That's true. So, these are topics which someone else had already decided that a meta-analysis was worthwhile doing, and had done one. You know, at the same time, this gives the advantage of having some information that has already been collected. But we also need to reanalyze it in a standardized fashion, so that the calculations would be compatible across all these 159 topics. So, you know, you can use these data to apply the same methods for synthesis and for understanding the weights and the heterogeneity and perhaps some tests for publication bias, using the same tools and the same methods across all these topics.
Russ Roberts: And--
John Ioannidis: And the fact that there is a meta-analysis already means that we are talking about a literature that may not be fully representative of everything that exists in economics. So, these meta-analyses are heavily, um, predominant for observational designs, not necessarily experimental designs. But this may not necessarily be so far off from the overall economics literature, where observational nonrandomized designs are far more common compared to experimental designs. And a second issue is that, these meta-analyses pertain to topics that probably on average have more studies than the average topic that has been studied in the literature. That, there's probably a lot of topics that you have a single paper and nobody wants to do a second one.
Russ Roberts: Correct; and these are--
John Ioannidis: And it's not going to be the typical situation that you will see in a meta-analysis, where you have 130 studies. Even though these 130 studies will not be exactly the same.
Russ Roberts: I just want to digress for a minute, and then we'll go deeper into the issue of power and statistical significance. But to digress for a minute: Is there--what's your feeling about the--you notice I said 'feeling.' It's a somewhat subjective question. What's your feeling about the use of meta-analysis as a way to overcome this issue of, just, "it's just one study"? Isn't it--if everyone is using the same methodology and has made the same mistake, a meta-analysis isn't any more comforting than a single study. So, what do we know about--do we know anything about, say, the meta-analysis of some area of economics? Is there any reason to think that a meta-analysis is more reliable than a random draw from the individual studies?
John Ioannidis: So, I need to extrapolate probably from economics and look at the meta-analysis literature and theoretical comparisons of meta-analysis against single studies, large studies, or different types of designs in other areas of science. Because, in a way, meta-analysis has probably been under-used in economics compared to other fields. So, we could find 159 topics, while if you go to medicine, there is about 100,000 meta-analyses that have published to date, or done. And it's a magnitude, or actually 3 orders of magnitude, more, compared to what has been done in economics. So, we have far more experience from meta-analysis in some fields compared to others. And, we know, also, more about the caveats and the strengths and the weaknesses of meta-analysis. I cannot answer in a black and white fashion that meta-analysis is always better than a single study. I mean, obviously, if you have an extremely well done study and everything has been very thoughtfully taken care of, and the way that it's run and analyzed is perfect; and on the other hand you have a meta-analysis--small, messy, horrible studies--I cannot really claim that the meta-analysis is going to be more reliable, just because you have more studies on board. But, in principle, science depends on cumulative knowledge. And that's a very basic premise for science--that we are looking at the totality of evidence. And the totality of evidence is going to tell us something more compared to any single study. Now, a single study may be the best among the lot. But even then, these other studies can still tell us something, because they map a universe; and even their deficiencies are interesting to note. So, what a meta-analysis can do is, one school of thought tried to give you the definitive answer--which I think is untenable because there's hardly ever a definitive answer; we're just trying to approximate closer to the truth, whatever the truth is. The second approach, second school of thought, is that it is a tool to look at the cumulative evidence and be able to compare studies and see patterns--patterns of data, patterns of bias, or footprints of bias. And lead to some interesting hypotheses about why this pattern of data is seen. And what does it mean, and how could we fix the problem if that seems to be the footprint of some problem that is causing this pattern of bias?
Russ Roberts: I bring it up because there are so many areas now in social psychology where a result that someone might have questioned in the past and was told, 'Are you kidding? The evidence is overwhelming. There have been dozens of studies that show blah, blah, blah.' Or blah, blah, blah could be priming, or whatever it is. And then, it turns out all of those dozens of studies have small sample sizes, and it turns out none of them, perhaps, replicate with a large sample. And, I bring that up because there is a growing use of meta-analysis in economics. The issue that's been brought up recently that I think is extremely important is this issue--it's not literally economics but it's work that's been done by economists--is whether deworming a population in a very poor country is going to help their economic future. You take the children, you deworm them; and a study was done by some economists, Michael Kremer and others, who found that it was fantastic. This generated deworming, and this generated a lot of money and the Effective Altruism movement to be donated toward deworming. And then a meta-analysis was done and it found no effect. Now, the response of the people who were in favor of deworming responded by saying, 'Oh, that's--those are bad meta-analyses.' So, it's complicated; it's hard to figure out how the world works, as you and I, I think, both know.
John Ioannidis: Yeah. I think that a meta-analysis has some validity and some problems. It has to be seen on a case-by-case basis, in terms of whether the validity is more than the problems. And meta-analysis is not going to fix the literature that is flawed. If every single study is flawed, you will get a flawed result from the meta-analysis. But, you can still get a sense of what is the impact of these flaws and what is the impact of these problems in design, and how do they comparatively affect the results of different studies. So, it's a wider picture. And in that way, I think it is useful, even if the result is not accurate and it's not credible, it is useful to see how does that universe of studies look like. Sometimes I see meta-analysis where it's very obvious that all the studies are completely flawed, but just by looking at that universe of studies you can really get a better understanding of what is going on here. While, looking at a single study or a single observation, it's not so easy to decide.
Russ Roberts: Yeah; I brought it up more as a digression, not so much as an indictment of your survey. Because the fact that you use meta-analysis--you're not claiming you found the truth, here. We're interested in using these existing meta-analyses to understand broad patterns in the empirical economics literature.
John Ioannidis: Right. And you are looking at comparative patterns. So, you are basically asking: Larger studies, how do they compare to smaller studies? That's a very basic pattern that you can address pretty much across any topic. And it's not dependent on what is the exact question being asked.
Russ Roberts: Now let's look at the empirical finding in your work, that you mentioned: that the average, I think you said, the average level of power in these studies was 18%. Most listeners won't know what that means. I only know what that means because I've been getting ready for this interview; and I confessed to you before we started the interview that, though I was trained as a Ph.D., got a Ph.D. in economics at the U. of Chicago, I never heard that phrase, 'power,' applied to a statistical analysis. What we did--and I think what most economists, many economists, still do, is: we had a data set; we had something we wanted to discover and test or examine or explore, depending on the nature of the problem. And our goal was to find a t-statistic that was greater than 2. Which is technically a measure of what's called statistical significance. And, most statistical significance--meaning a p-value of 0.05 or smaller--and most, if not all--not all, but most published results in many fields using econometric or statistical analysis in a multivariate way, meaning multiple variables trying to explain the pattern in a dependent variable have to get across that hurdle. You have to get a p-value of .05 or less. There has to--it has to be statistically significant. And when you do that, it's golden; and you can publish it, in theory. Not every time, but you've got a shot. If you don't find it, you're not likely to be able to publish it. And so, that, I think most economists today know a lot about that--though we might not define it exactly correctly; I struggle with it sometimes myself. So, that's on the one hand. One the one hand, we're saying--I'm going to let you describe it. So, describe statistical significance at the .05 level. What does that mean?
John Ioannidis: So, I think that we have to be a little careful here, because we didn't really make assumptions about statistical significance at the .05 level here, for these meta-analyses. What we tried to ask is: What is the power of a study to be able to get a result that would cross that level of statistical significance at the 0.05 level, if the true effect out there is x? And now the question is: How do you know the true effect? I mean, nobody really knows the true effect. There are different ways to approximate it, and one way to approximate it plausibly is to say that: Well, maybe if you consider all the evidence, then the true effect is best represented, or best approximated, by all the evidence. That's the best shot that we can have. A second approach would be to look at what are the effect-sizes in the largest studies, and then the question is to define what exactly do we mean by the 'largest' studies? And one approach is to look at, for example, the top 10%--the 10% of the reported estimates from the most precise studies, the ones that have the least uncertainty in their estimation. The other is to take the top one, which is the most precise of all--so, the largest study, in a sense, the one that has the least uncertainty. And the third is a more sophisticated approach, which we call PET-PEESE--Precision Effect Test-Precision Effect Estimate with Standard Error. Which basically is a regression; and it tries to, in a way, estimate what would have been the effect if you go towards an infinite-sized study. So, it's extrapolating from what we have to the ideal, very large study: What would it look like? So, there's different ways to approach what might be plausible effect sizes. And then, you ask what is the power to detect these plausible effect sizes. Power, meaning: If that effect is there, how likely is it that with the type of sample size that I have in a given investigation, in a given design, I will be able to get a statistically significant result that is less than 0.05. And this is what the power is, practically [?]
Russ Roberts: Let's do that again. Let's say that again. So, let's try to put it in the context of an actual empirical question that might be examined in economics. One of the ones you mentioned in the paper is the impact of a minimum wage on employment. And a caveat: Of course, there are many other aspects and impacts of the minimum wage besides whether you have a job or not. It can affect the number of hours; it can affect the training you receive; it can affect the way you are treated on the job. And it bothers me that economists only look at this one thing--this 1-0 variable, job-or-not. Number of jobs. Without looking at the quality, outside of the monetary, financial aspect. But, that's what we look at, often. And it is the central question in the area of minimum wage policy: Does it reduce or even expand potentially--which I think is crazy, but okay, a lot of people don't agree--whether it expands or reduces the number of jobs. Now, in such an empirical analysis of the minimum wage, how would you describe the power of that test? Meaning, there's some effect that we don't know of that impact. The power is--fill in the blank--the probably that?
John Ioannidis: Right. So, for that particular question, the median power if I recall that we estimated was something like 8 or 9%.
Russ Roberts: It is. I looked at it; I've got it right here. It is 8.5%.
John Ioannidis: There you go.
Russ Roberts: That means--so, what does 8.5% mean, in that context?
John Ioannidis: It means that, if you estimate for each one of these studies that have been done, what are the chances that they would have found that effect? That they would have found a statistically significant signal, if the effect is what is suggested by the largest studies, for example? Their median chance would be 8.5%. So, 50% of the studies would have 8.5% chances or less to be able to detect that signal. Which is amazing. I mean, if you think of that--
Russ Roberts: It's depressing--
John Ioannidis: Or depressing, actually. I mean, they basically have no chance of finding that. Even if it is there.
Russ Roberts: So, does this work on both sides of the question?
John Ioannidis: It is very, very difficult for them to pick it up.
Russ Roberts: Does this work on both sides of the question? Meaning: It obviously depends on your null hypothesis. So, if your null hypothesis is: Minimum wages have no effect, and I'm going to test whether they have an effect, you are going to say: Does that mean I'm going to find that I only have an 8% chance of finding that effect?
John Ioannidis: Yeh. It would mean that even if that effect is there, you would have an 8.5% chance of detecting it.
Russ Roberts: So, most of the time, I would not find it.
John Ioannidis: So, most of the time you would find a non-significant result. Called a null result. Or, seemingly null result. Even though there is some effect there.
Russ Roberts: But it could go the other way, too. Because your null hypothesis could be that the minimum wage has an effect; and I'm testing whether there is no effect. And I might not be able to find no effect. Is that correct to go in that opposite direction?
John Ioannidis: So, what happens in the opposite direction is that when you are operating in an underpowered environment, you have two problems. One is the obvious: That you have a very high chance of false negative. Because this is exactly what power means. It means that 92%, if you have an 8% power--92% of the time, you will not be able to pick the signal. Even though it is there. So, it's a false negative. At the same time, you have the problem of having a very high risk of a false positive when you do see something that has a statistically significant p-value attached to it. And, it could be an entire false positive, or it could be a gross exaggeration of the effect size. And, um, it could be that the smaller the power that you are operating with, if you do detect something, even if it is real, the magnitude of the effect size will be substantially inflated. So, the smaller the power, the greater the average inflation of the effect that you would see, when you do detect it. So, two major problems. With low power: lots of false negatives. Second problem: lots of false positives and gross exaggeration of the effect sizes.
Russ Roberts: Yeah, I think--
John Ioannidis: And you add a touch of bias to that, and obviously there are many different biases. But many of the biases that operate, have their common denominator that people are trying to find something rather than trying not to find something. It makes sense--
Russ Roberts: Well said.
John Ioannidis: So, someone is trying to maybe sometimes change the analysis a little bit or try another analytical mode, add some more observations or do a few more experiments or keep trying until they get the statistically significant p-value, somehow. So, if you add this sort of bias, which, based on what we have seen across multiple fields seems highly prevalent, then the rate of the false positives and the exaggeration really escalate further. And they can really skyrocket pretty quickly--
Russ Roberts: and as a result--
John Ioannidis: unless these biases are contained pretty thoroughly.
Russ Roberts: As a result, you get these dramatic papers with these huge impacts, some variable, some policy. And they are not reliable. I think Andrew Gelman calls this a Type M error, where M is magnitude.
John Ioannidis: Magnitude.
Russ Roberts: So, here's the part that's confusing for me, and I think I have some understanding of it, but I find many economists literally do not understand this at all. And certainly everyday normal human beings are going to struggle with it. So, here's the question: Say, you have a "small sample"--and of course, 'small' is--it depends on the size of the magnitude I'm trying to measure. And all kinds of things as well. But I'm going to use that phrase. I have a sample that--a better way to say it, is it is going to be under-power. But let's just say it's small to start with so that people can understand what I'm talking about. So, I have a small sample. I take a sample of--let's say I want to figure out whether men are taller than women. And so, I go out and I sample 10 men and 10 women. And, you know, I could find lots of different things in that sample. I could happen to have chosen 10 relatively short men and 10 relatively tall women. And it would look like women are taller than men. But that result, given that there are only--there would have to be a very big difference given the size of the sample--by definition, statistical significance is going to take account of the size of the sample. So, I might find that women are taller, but it's unlikely in a small sample it's going to be statistically significant. Another example people use sometimes is a fair coin: If I flip a coin 100 times, I might get 55 heads. In fact, I'm going to get 55 heads fairly often out of 100 tosses. Doesn't mean the coin is biased. It's just the sample is not large enough to measure whether the coin is fair or not. So, a lot of times then, what economists do--and psychologists as well, and other folks--when they get a small-sample statistically significant result--in other words, they find it's statistically different--it's unlikely that these data were the result of just chance, they then say, 'Wow. If I found it with a small sample, just think how statistically significant it would be with a large sample.' So, when economists find statistically significant results in small samples--and the definition of small here is going to be essentially underpowered--they are going to say, without looking at the power, they are going to say, 'Hey, look how great this result is. You can't deny it because it's even true in a small sample.' And then you come along, and Andrew Gelman, and others, and say, 'Actually, it's the opposite. With a small sample, the more likely it is that what you found literally isn't true.' So, can you try to explain that intuition? Sorry for the length of the question.
John Ioannidis: Yeah. So, this is what we call the Winner's Curse. And it's pretty much the same phenomenon that I was describing earlier, that, if there is a signal, a true signal, to be detected, and you are running in an underpowered environment with very small studies, with very few observation like the 10 and 10 sample that you described, if you find it, you will find it in a way that will present itself in a much bigger magnitude compared to what it really is. Because, if it presents the way that it really is, it will not be significant. So, you will not detect it; you will not say, 'Eureka!'; you will not open a champagne bottle. But if you are lucky, or unlucky in a way--if you have this Winner's Curse to chance upon a configuration of the data where this is very prominent, then you will say, 'Wow. Look at that. This is fantastic. This is amazing. This is huge.' But, you know that the true effect is going to be much smaller. Now, it could be much smaller or it could be nothing at all.
Russ Roberts: So, that's the question. I understand it could be smaller. The hard part, I think the intuition is--and I guess, just to back up for a second: I understand why in a small sample I could have a false negative. I could say, 'Uuup, there's nothing there.' But, come on, you only have 10 women and 10 men; let's say they came out to be exactly the same height. You say, 'Well, I guess women and men are the same height.' That would be silly, because your sample was too small to find it; and it's underpowered. And you are likely to have a false negative. Why am I likely to get that significant result in that finding, and that it's a false positive?
John Ioannidis: So, I think that it could be either a false positive or an exaggerated--sometimes grossly exaggerated--effect, depending on how small the sample is that you were working on. It depends on what is the pattern of effects circulating across the field at large. So, if someone is working in a field that, let's say, there's a lot of prior evidence and very strong theory and other types of insights that have really guided us to create questions where the answers to many of those are likely to be non-null effects, then you are likely to fall into the pattern of just finding an exaggerated magnitude of the effect size rather than a complete false positive. If you are working in a field where you are just completely agnostic--black box, just searching in the dark, and actually in a field where there's not much to be discovered, just tons of noise--practically, then, if it's all noise, no matter what significant results you get it will be a false positive. So, there is a continuum here. There is a continuum of different fields and different priors of how many out of 100 or 1000 or 10,000 hypotheses that we are testing are likely to be hiding something that is genuinely not known [?no-no?]. And, there is a lot of variability in that regard. I think that economics is mostly operating in, let's say, middle ground. But there is a lot of variability. I think that people, for example, who go to do a very large, randomized trial that is very expensive, most of the time I would argue they have thought very carefully that that's not going to be a waste of money. And they have a decent sense of showing something--
Russ Roberts: --that's real.
John Ioannidis: I don't think that someone would do a randomized trial--
Russ Roberts: --that's real. They are going to [?serve?] themselves, then, trust me. But, you are saying--
John Ioannidis: Yeah. Yeah. I think that if they had a chance of 1 in a million of finding something, then they say, 'I'm going to do a trial that is going to require $50 million dollars to run,'--I don't think that that would be a good investment.
Russ Roberts: Correct.
John Ioannidis: Conversely, there's other fields where we're in a completely agnostic mode and we just ask hypotheses like crazy. And we ask millions of such hypotheses. And this is very common in big data science. And we know that the yield is going to be very low. Which is looking through a haystack and there's a few needles in there. So, these needles are few. And most of what we are going to detect is likely to be a false positive, unless we find ways to further document that what we have found is really true. Which means, typically, doing more such studies; having very stringent statistical significance thresholds; requiring very stringent replication to see it again and again. And then we can say, 'Well, no; that's true.' So, there's a continuum. And each field is operating in a different point within that continuum. I think most of economics research, I would dare say is operating somewhere in middle values of that continuum--so, not completely agnostic, and not very high prior. But there is a range; and different studies may be at higher/lower levels within that range.
Russ Roberts: To come back to this question of this intuition of discovering a result that's probably spurious--a false positive or a large false positive: The way I would read your perspective on this is that there are two sources of that mistake. One is just noise. Sometimes you are just going to draw from the urn of life a particularly unrepresentative result. But the other is publication bias--that 'I'm going to keep changing my specification, adding variables, changing the sample,' etc., to make sure that I can get a published result; and I'll strangle the data until it screams. In which case I would get that statistical significance. And I assume it's both of those working together. It's not just one or the other.
John Ioannidis: Absolutely. And, there can be different terms about what you just called publication bias. I tend to use the term, 'significance chasing,' or 'significance chasing bias,' or 'excess significance bias.' But, there's so many terms that have been coined in different fields. Trying to describe pretty much the same phenomenon--
Russ Roberts: P-hacking--
John Ioannidis: People have seen that this is--p-hacking is a very popular term in psychology and other sociological sciences. But, it's just a fact that people have seen that this is a major problem, and have coined these different terms to try to describe it.
Russ Roberts: So, when you used the metaphor of a needle in a haystack--that there might only be a couple in a big data set, actually, I think, maybe, a different metaphor is that there's an infinite number of needles: There's all these correlations that can look significant in a data set of large size. And most of them are not meaningful--that is, they are not replicable, they are not going to replicate it; they are just the product of randomness. Is that--would that summarize--that summarizes my worry about big data. What do ¬you think about it?
John Ioannidis: So, yes. I mean, I probably wouldn't use the term 'needles' to describe this, because needles would mean that they are true. But in a universe of big data, you are entering an environment that has the opposite problem of what we are describing in these meta-analyses that belong mostly to the past--well, they belong entirely to the past. But, they are meta-analysis. Most of the analysis in the past were small studies. They were underpowered; they were at risk of these false positives, and false negatives, and exaggerated results. Now, we have more and more big data studies, which are over-powered, and where, again, just testing with the typical statistical tools that we have, nominal significance means close to nothing. It's likely that any analysis will be statistically significant one way or another. And, then you don't really know. Then, statistical significance has very little discriminating ability to tell you which ones are the real needles and which are just flukes.
Russ Roberts: So, for all graduate students and professors listening to this in economics, and any other field: When I now go to an empirical presentation, or a presentation of an empirical paper, I ask with a straight face, 'How many regressions did you run?' You know, the table--at the end, I get a table. And the table has got all these asterisks. And the asterisks are all significant at the .005--significant at the point 0-0-5--significant at the point--.005--you know, it's just full of significant results. And I say, that's lovely. But, how many regressions did you run? And it's such a startling question, the couple of times I've had a chance to answer--ask it. They don't answer it. It's not because they are embarrassed. It just never crossed their mind. It's not even a question. So, the problem, I think, in our field, and others--epidemiology being another example--is that there are so many opportunities in the kitchen, to do, whether it's p-hacking or what Gelman's called the Garden of Forking Paths. I have so many decision nodes to try different things. And unless you watch the videotape of how the food was prepared, you have no idea if it's safe or not.
John Ioannidis: Exactly. And, much of the time you cannot even count them. So, there are some situations where at least you can count them. Like, genetics, for example. You can count how count how many genetic variants you are testing. You know--if you are honest to yourself, and to others, you know that I am testing 10 million variants and you know what their correlation structure is. And you can use a formal correction for that. Either just a multiplicity correction, or some other way with a false discovery rate, or something equivalent, that will take care of the exact multiplicity burden that you have. In many other situations we don't really know exactly how much multiplicity we are dealing with. I mean, we are probably fooling ourselves, because we are going down that garden of forking paths, and we lose count--down the path of how many nodes did we need and how many options were there in each node? And how many choices did we make? And, many of these choices could be even subconscious. Or, mild, modest modification of one analysis, versus the original one. So, it's very difficult to estimate the exact multiplicity burden in that case. It's--you know it's there. But, you can't really put a number. You can't really use some direct method to correct for that multiplicity.
Russ Roberts: So, if you're giving advice to a young scholar in any of the fields we are talking about--and I guess it, could argue it's every scientific field in a certain dimension. But, let's talk about observational studies, as opposed to random control trials. So, they have their own separate sets of issues. But, people who are doing what for now, for, I don't know, 80 or 90 years has been classical statistics. And I'm a skeptic. Right? I've carved out that niche. And it's a dangerous niche. Because, if you're not careful, you just reject everything. You say, 'Oh, we can't know any of this stuff.' And that's obviously not true. And I don't really believe that. But I am highly skeptical of these observational studies. Should I be? When someone presents me with a result? What should I, as a practicing economist or practicing epidemiologist--what advice would you give us for trying to figure out what's true?
John Ioannidis: Well, I would probably go back and ask: Is an observational design having any real chance of giving us some reasonably decent realizable answer here? And there may be many situations where they could give an answer that is fairly reliable. I mean, it's unlikely that it will be conclusive and definitive--in a way, nothing is 100% definitive. But at least a high enough in that scale of being definitive that you can take it to the next step. There are some situations where, when you just think about what are the odds of getting it right, maybe some designs are just not to be used. You should not use them. You should just abandon them. For some types of questions. To give you one example, we have performed hundreds of thousands of studies trying to look whether single nutrients are associated with specific types of disease outcomes. And, you know, you see all these thousands of studies about coffee, and tea, and all kind of--
Russ Roberts: broccoli, red meat, wine--
John Ioannidis: things that you eat. And they are all over the place. And they are all over the place, and they are always in the news. And I think it is a complete waste. We should just decide that we are talking about very small effects. The noise is many orders of magnitude more than the signal. If there is a signal. Maybe there is no signal at all. So, why are we keep doing this? We should just pause, and abandon this type of design for this type of question.
Russ Roberts: We'd like to know. And that desire to know is so strong.
John Ioannidis: Of course. Of course. But, to know, we need to use the right design. So, I would argue for this type of questions, where the error is 50 times bigger than the signal, we need to find designs that minimize the error. And, our best chances in these cases, if we still believe that it's be important to know, they would be randomized trials. Or at least experimental trials that minimizing confounding minimized error as much as possible. Even those may not be able to get us an entirely definitive answer. I'm not saying that they are a panacea. But, at least we know that we are not starting completely off base. Even knowing that we will get a [?drunk?drone?drown?] no matter what. There's other cases where observational designs may be very useful, and very illuminating. There's sometimes effect sizes that are big and situations where we can have a pretty good understanding of what the confounders might be, and what is really influencing what. And, in that case, they are definitely having a role. So, we never got a randomized trial to prove that smoking causes cancer. But, smoking increases the risk of cancer 20-fold, as opposed by 1.001-fold that many of these nutrients do. So, I would never argue that we need a randomized trial for proving that smoking is a bad thing for us. It has to be seen on a case by case basis. But, there is a lot of observational research that is really going beyond the performance characteristics that are being used. And I'm not sure that this is a good investment. One could always say that I do this for exploratory purposes and just to get a preliminary insight. But, I worry that much of the time we just don't get any preliminary insight, and even, these data that emerge are just biasing our thought.
Russ Roberts: Yeh, I agree.
Russ Roberts: I want to go back to Big Data for a minute, and just a general question in how one should think about empirical work. A lot of younger economists have told me that, 'Theory is over-rated. We just need to look at the data and see what the data say.' And, 'The data will speak.' What's your thought on that? And that's part of--by the way--the appeal of machine learning and Big Data, is that, 'Our theories are imperfect, so we'll just see what the reality is,' is the way they, I think, think about it. What's your thought on that?
John Ioannidis: Well, I'm not saying not to look at big data, but looking at big data, you see the patterns in the big data. This is not the same as saying that you see the truth or that you see causal effects or that you see the answer to important questions. You see patterns. I am very eager to do that; and I do waste a lot of my time looking at patterns in big data. But I want to be honest to myself that I am just looking at patterns. I'm not looking at the final frontier. And, these patterns are sometimes very difficult to interpret, and based on different theory, they would be interpreted very differently. So, I don't think that we have the end of theory; I don't think we have the end of statistical testing in any means, as well. But, big data have to be seen with a lot of caution. I think that we really need proofs of principle that these sorts of analyses eventually do help and are useful. So, it's not just an issue of, is it true or not, but also an issue of: Does it help, and can you build, for example, policy and decision-making on them? And, to be honest, I have seen very few examples where you can build reliable policy and decision-making based on Big Data. I mean, you can probably mislead your policy very easily with Big Data; and you can mislead in any of a gazillion ways that you may want. But, I would like to see more concrete examples where that would really be helpful. For the time being, I see it more as exploring an interesting space: learning about the data, learning about the patterns, learning about their errors, their biases; how we can fix some of these errors. So, it's like a machine that is still to be probed, and try to see what can we make out of it.
Russ Roberts: So, given your skepticism about many research designs and the nature of the complexity of the world, one of the issues that I struggle with is people then assume I'm against science. I know, you are laughing out loud. But they say it about me all the time. And I also say, I also make the argument that very few--maybe zero--questions in economics have been settled by a single great study. And I think that's true of science, generally, by the way--it's not an economics problem. That, empirical work tends to build up over time. But, even in economics, there's always a loophole. There's always a way to say, 'Oh, yeah, but that was after the war. You see, after the war...,' there's always--we don't typically do what I would call real science with experimental, real control trials, even in the ones that we call 'real control'--'randomized control trials.' They are subject to the location. They are subject to the context. They are subject to the way the instructions were given. So, I'm just--I'm overly skeptical, which is again I concede maybe a flaw. But I don't believe that evidence or facts are irrelevant. I do believe I've changed my mind about lots of things. It's just not when I open up a study of econometrics [?Econometrica?] and go, 'Well, I guess I was wrong.' How do you handle that? Do you get a lot of that, or not?
John Ioannidis: So, I think that there is a risk that you may get pushback by people saying that if you kind of disseminate a picture of science getting it wrong, and having so many problems and so many biases and so many difficulties, then you may offer ammunition to people who say that science is not worth it. And, of course, this is a risk. But, at the same time, in a way, this is the way that science works. I mean, science is not working with dogma. It is not working with absolute truth. It's working with some healthy skepticism. It's working with the desire to reproduce and replicate what we see, and document it very carefully to diminish biases, to improve methods. So, this rational and to some extent skeptical thinking is at the core of the scientific method. I don't think that we should abandon the scientific method or distort the scientific method so as to give it a message that science is perfect, because that's not what it is about. It's a very difficult endeavor. It's fighting and struggling with errors and biases on a daily basis, and trying to do our best, and getting as close to the truth as possible. I think also that if we go along the narrative of 'Science is perfect,' whenever you have these debates and contradictory data and big promises that are not fulfilled, then science becomes a very easy target for the wrong reason. And people say, 'You promised me that,' or 'You told me that, and now this is not so.' And we have not really made any cautious announcement ahead of time, that, 'Well, we know that with not perfect certainty'; 'We know that this is maybe 60% likely to be true, but there's a 40% chance of error.' Maybe there's a 70% chance of error. Unless we are accurate about our level of uncertainty, I think we will run into trouble. And I think we are running into trouble. And, in medicine, we see that all the time. You can have just a single paper that got it wrong--like, Lancet publishing a paper that MMR [Measles, Mumps, Rubella] vaccines cause autism. And then you have hundreds of millions of people who don't want to vaccinate their children. And, we're heading back to the Middle Ages. And the problem started from getting it wrong, and not having a message that we could get it wrong, and, you know, 'Some of our papers and our top journals could be wrong,' and that was not just wrong. It was more than that: it was actually fraud--which is not so common. So, how do we give an accurate picture of what science is? Which, to me, is the best thing that has happened to Homo Sapiens--sapiens. But, it's difficult. And it does have errors and biases; and that's what we're struggling with every day.
Russ Roberts: Well, I interviewed Adam Cifu, who is, you know, co-author with Vinayak Prasad of the book Ending Medical Reversal. And what 'medical reversal' is, is there's this idea that when a study comes out saying 'This is good,' or 'This is bad,' and people take it, 'Well, it's peer-reviewed so therefore it must be true.' And then--that's an observational study--when they go and do the randomized control trial, they find out that the result is the opposite: You shouldn't do that technique, or you should do something else. And, I just think it's--I think it's in fact an extraordinary thing, actually, given our power of reasoning, that we have so many false positives and false negatives because of our love of science and statistical sophistication. It seems like a big challenge for us to overcome that.
John Ioannidis: Mmhmm. Mmhmm. Yep.
Russ Roberts: Now, a lot of people suggest that we should change the level of statistic significance. It's funny--there's no law--there is a law; there's no legislation, as we make that distinction here. It's a norm that 0.05 is the right amount. What do you think of that as a way to--'We should be more demanding. We just should make a higher hurdle for people to get statistical significance.' And then we have people like Andrew Gelman who have said we should just stop talking about it completely. What's your thought on that?
John Ioannidis: So, I was one of the authors in the paper that suggested moving traditional threshold from 0.05 to 0.005--so, adding an extra zero. And, I see that as a temporizing measure. I don't see it as a perfect fix. I think that in many--most--circumstances actually using statistical significance with p-values is not the best way to approach the scientific questions. In a few case it is--maybe, I would say, in the fields that I am working in, which are mostly biomedical but not necessarily so, about 20% of the time, null hypothesis significance testing would be the way to go, indeed. The other 80%, not at all. Or, very second or third type of choice. Why did I co-author that paper? The reason is that we are living in a situation where we have a flood of significance. So, that extra zero is like placing a dam to avoid death by significance. You know, drowning by significance. It's a temporizing measure. Would it solve all the problems? No. But, probably, what we have seen across different fields, about, on average, 30% of those false positives would no longer be false positives, because they would be in that borderland between 0.05 and 0.005.
Russ Roberts: But you are assuming that the authors wouldn't have tried harder.
John Ioannidis: Well, but, then the question becomes: once you have that dam in place, authors would be p-hacking around that new standard. So, instead of trying to pass the 0.05, they will be doing their best to pass the 0.005 threshold. But, this is becoming a bit more difficult for them; and with the current sample sizes that are circulating in most scientific fields, this is not going to be easy. When they do make it, then the bias will be worse. So, the average inflation for exaggerated results would be even more. But there would be fewer such. So, I see it as having some advantages, some disadvantages. Probably on average substantially more advantages at the moment, compared to disadvantages. But it's not the perfect fix. It's not the end of the day. I think that we need to think more broadly about replacing our statistical inference tools with more fit-for-purpose [?] tools, and also moving to the design phase of research. So, designing studies that have a higher chance of getting us close to the truth, with less uncertainty.
Russ Roberts: What do you think about pre-registration, where a scholar would put down in writing somewhere publicly what they are going to be looking at, to reduce the p-hacking work in the kitchen?
John Ioannidis: I'm very much in favor of pre-registration; and I have supported that for many years over a decade, and in various fields. I think that it can help. I think that it has helped in some domains like clinical trials in medicine. Is it perfect? No. About 50% of trials are registered; and of those, about 50% are properly registered; and of those that are properly registered, about 50% report their outcomes; and of those, maybe 50% are well done in other dimensions of their design. So, eventually it trickles down to smaller and smaller numbers that would be protected from various biases. But, at least, it's a step in the right direction. Can we apply to any type of research? I don't think that this is easy to do; and I would be very happy for lots of research that is exploratory just to acknowledge that. So, if something has been obtained through a garden of 14 paths and zillions of analyses and extremely complex meandering paths of thinking, saying that this is pre-registered is just trying to fool others and fool ourselves. What should be conveyed about this research is that it was entirely exploratory, extreme data-dredging at its best; and that's fine, provided that we know that this is what it was. And then, at a second stage, someone could preregister a study that follows that exact same meandering recipe that emerged from that exploration.
Russ Roberts: Different data set, time period.
John Ioannidis: Different data set, different study. Now that you have this very peculiar combination of choices and design and analysis, okay, 'That's what you got; let's try to repeat it and see whether it works.'
Russ Roberts: Anything else you'd like to recommend to editors or young academics for how to make this problem get better? Any policy changes you're in favor of?
John Ioannidis: I think that it's not a one solution that would fit all. There are over a dozen families of solutions that are being discussed, and some of those I have reviewed in some of my recent papers. In a way, some of these solutions could be complementary, or they could co-exist. And one may help another. So, creating a replication culture, pre-registration, data sharing, protocol availability, better statistical methods, picked-for-purpose statistical methods, stronger and more stringent thresholds, different types of peer review, more openness in peer review, more transparency--all of these have lots to share. So, sharing data can facilitate peer review. It can facilitate replication. It can facilitate team-science. It may lead to making pre-registration more plausible. There is a very high correlation between these ideas; and eventually these ideas would work if we have multiple stake-holders who believe that they are worthwhile adopting. It's very difficult for a single scientist to just go out there and say, 'I'm going to do it differently than all of you.' It's very difficult for a single journal to do that. It's very difficult for a single institution to change their practices. But, if people recognize that this is good idea, and you have multiple journals, multiple institutions, multiple funders, multiple scientists who believe that this is the way to go, then we do see change. So, for example, registration for clinical trials had been out there as a possibility for 30 years; but it was not really happening until all the major medical journals said, 'I'm not going to publish your trial unless you have pre-registered it.' And, then funders also joined. And then everybody wanted to do it, because they wanted to have their paper published in the best journals. And the same applies to other fields. Economics has made tremendous progress over the years in terms of some of these transparency practices. Especially the best journals have adopted several of these practices.
Russ Roberts: So, I want to apologize to you. I think I first heard about your paper, "Why Most Published Research Findings Are False," I think I first heard about it from Nassim Taleb--I'm guessing; I'd have to got back and look at it. And I thought, 'Well, that's ridiculous. That's just silly. What kind of a paper is that?' And it was a theoretical paper. It wasn't like you went around and then you went and re-measured it and you showed they mismeasured it. It's a very interesting paper, actually, obviously, and it's a very provocative paper. So, my apology is there is a lot more to it than I had thought from the title. But my question for you is: Since you say you constantly, you want to be constantly reminded that you know next to nothing: You write a paper like that; and then Brian Nosek and his team in psychology finds that only 40% of the top papers in psychology in the last 10 years replicate, you must feel pretty smart. So, how do you keep your humility?
John Ioannidis: Oh, goodness--
Russ Roberts: It's a trick question. Sorry.
John Ioannidis: So much potential for making mistakes and errors. And, you know, just finding biases. Or not knowing about biases that you have in your own work. That, some humility is indispensible. I think that this is what's really interesting, and nice about science--that there's no end to revealing how many mistakes you can detect and you can fix. And, saying that I have detected the final mistake and now I have been doing perfect research, that's very presumptuous. So, I'm trying to not forget that. And I'm trying to keep reminding myself that maybe all of my work is wrong. Who knows?
Russ Roberts: Well, what are you working on? You took on economics lately. What else are you working on?
John Ioannidis: So, as part of the work that we are doing at the Meta Research Innovation Center at Stanford, the big privilege is that we can work across very different types of domains. And, I'm surprised and excited to see that many of the problems that we have seen by medical fields are not just applicable to these biomedical fields. They accure [?accrue?] in very different areas. So, we have a great network of collaborators, and I really enjoy working with people who are not in my core fields, because they can really teach me about what is going on in their field, and what are the issues. So, my collaboration with Tom [T. D. Stanley?] and with Hristos Doucouliagos in that paper was really fascinating for me, because obviously I'm not an economist. And getting to know that literature from an insider view was really fascinating. At the moment I'm working on appraising biases and trying to test out solutions in very different fields. And then, it's--there's really no end to it. I think that there's a lot of exciting work happening in psychology and social sciences. Economics just as well--it has some very exciting leads at the moment. There's a lot of questions on Big Data, on registration of different types of studies, on new designs for randomized trials, for advantages and disadvantages of experimental design versus observational data. On pragmatism. On how do you differentiate between credibility and utility in research? Implementation issues of research practices; reward systems and incentives; trying to network different universities and leadership of universities and funding agencies and re-addressing and re-discussing how do they prioritize rewarding and promoting and funding scientists? So, it's--I feel a little bit like a kid in a candy shop. There's so many things going on. And, all of that is just so exciting.
Russ Roberts: Well, as an economist, although all the--I would call it the nuts and bolts of good science: Transparency, ideas of registration, survey/research design, experimental design--these are all really, really important, and it's important to try to get them right. I would just suggest that it's hard to get them right in a world where we as academics now can make a large sum of money, and we're getting on the front page of the New York Times--it's still a lot of fun, and also the institutions that we work for really like that. So, as long as that's there, your big challenge--and I salute you for taking it on--is: How do you fight against that fundamental incentive? We have this romance about our task that we are just truth-seekers. But we are also human. And those financial incentives have changed so much over the last 50 years, for the mainstream members of economics and other fields.
John Ioannidis: Mmm-hmm. Well, there's clearly some incentives that are misaligned. But, the question is: How can you really realign them? And I don't think there's anything wrong, necessarily, with financial incentives. It's just an issue of: How do you get them to work for you and for better science rather than for more short term gains?