Brian Nosek on the Reproducibility Project

Nov 16 2015

Brian Nosek of the University of Virginia and the Center for Open Science talks with EconTalk host Russ Roberts about the Reproducibility Project--an effort to reproduce the findings of 100 articles in three top psychology journals. Nosek talks about the findings and the implications for academic publishing and the reliability of published results.

LISTEN NOW:

Comment

●

READ TRANSCRIPT

●

DELVE DEEPER

DOWNLOAD

Time	Podcast Episode Highlights
0:33	Intro. [Recording date: November 6, 2015.] Russ: Now, you were a guest about three years ago. I think you had just started the Reproducibility Project, which was an attempt to reproduce results--and particularly starting in Psychology. And the first results from that project have now been published. So I thought it would be a great time to review that whole enterprise and have you back on to see where, what you found, and where you are going in the future. I want to start, though, with some background. What does it mean for a result to be reproducible? Because there are different ways to think about it. So, when you use the term 'reproducible' in these ventures, what do you have in mind? Guest: Yeah. It does mean multiple things. And the typical use of reproducibility is, if you already have some data and a finding, can I, using the same data, obtain the same finding? And that could be as simple as taking the analysis code that you prepared and hitting 'Run' again, in an environment that produces the same result. Or, it could mean just taking the data and, following the logic of what you did, generate my own way of analyzing the data and getting the same result. The way we use 'reproducibility' is the further extension of that. Which is to actually collect new data and see if we can obtain the same result. This form of reproducibility is also called 'replication.' But 'reproducibility' is often used as an overarching term for these various ways to [?]. Russ: Yeah. And, what you really want to know is: Did the--typically, in psychology, for example, you are looking at an experimental result; and you want to know if you run the experiment on a different group of people, but follow the same protocol--that is, ask the same questions or put the subjects through the same experiences--whether the result is the same. So, to me--so, let me make sure I understand this, because I'm a little confused. Replicate--what do you mean by replicate? Guest: So, in this context, replicate is the act of trying [?] to create the same conditions that are necessary to observe that particular finding. And that's usually following the protocol very closely. But it might require some adaptations to that protocol, that are perceived to be irrelevant to the finding. Right? So if we did the original study in the United States, and we want to do a replication in Germany, we might need to translate the materials to inform[?] the participants be able to read it. So, it's not perfectly redundant. And no study is perfectly redundant of the original. There's always lots of changes. But the key is that those changes are deemed to be, understood to be irrelevant for what you are actually invested in. Russ: So, why would you--I'm going to say this in a negative way on purpose--why would you waste your time replicating these results? We've already found them. What are some of the issues? We talked about this 3 years ago but I want to review them. Why would you be suspicious or concerned about some of the findings in the fine[?] peer reviewed literature journals in your field or mine? Guest: Well, the general answer is that reproducibility is central to science. Right? So, a scientific claim doesn't claim credibility because of an authority saying, 'This is true; you have to believe it.' Or, because that person has a good reputation. Scientific claims gain credibility by the ability for them to be independently reproduced. Someone else can follow the same procedure and find the same results. So, the credibility itself is not [?]. So, that is a core principle of how scientific claims become credible. It means that half[?]--I mean, principle research is a core value of what science does, how it operates, how we succeed in developing knowledge in science. So, then, yeah, great; that's a value. Why do we[?] need to then actually redo experiments that have been shown in the literature? And the answer is because we don't know the rate of reproducibility in the publishing literature. We can assume that it's high. But we don't know that it's high. And there's a lot of indicators, prior to this project being started, that suggest that there were challenges to reproducing--reasons to expect that it might be longer[?] than we anticipate. And a lot of those boil down to the incentives that drive individuals' incentives' behavior: My success as a practicing researcher is contingent on the publishing; and the publishing is often as possible and in the most prestigious outlets possible. And what gets published isn't necessarily--there are certain things that are more likely to get published than others. I'm more likely to get published if I get beautiful, clean, positive, innovative results. Because those are the best kind of results. But not everything that we do in our research looks like that. In fact, most of it doesn't. And so my sentence[?] are to try to make the research as beautiful and publishable as possible, not necessarily to make it as accurate as possible.
6:19	Russ: Well, that's what Photoshop is for. Oh, no, that's a different field. But, so there are sort of two problems here. One is fraud--which is that there is an incentive, unfortunately, to literally cheat. But that's not the only problem. Guest: Right. Fraud is certainly a problem, and to go that distance, to actually deliberately deceive, is a big deal. But that is probably a very small part of the challenges for reproducibility. Because, most scientists probably not going to go that far, even though the incentives are strong. But rather, most researchers are in it to learn something. They are trying to get to the truth. They are trying--why do all this work if we weren't motivated by discovering and trying to find new knowledge and trying to apply that knowledge to solve human problems? So, there's a lot of genuine effort that researchers who are trying to discover truth [?]. But we are also trying to survive and thrive as practicing researchers. And because of that, we have a conflict of interest. The findings that I obtain are, have impact, some [?] outcomes. And so I will find reasons, not necessarily intentionally, to drop studies, to drop analyses, to analyze things multiple ways. I have flexibility in what I--and it's [?] papers that I try to get published. And if that isn't a complete representation of how I got to those findings, then what's in the publishing literature could be more beautiful than what reality is. Russ: So, in the paper you wrote a while back that we talked about three years ago, there are two things--I just want to mention again because they are so important. One is the line, 'Published and true are non-synonyms.' Which is hard to accept. I think journalists certainly have an incentive to ignore that truth; so they publish lots of things that are published results in peer-reviewed journals because they are really interesting and people want to read about them. Whether they are true or not is a different question. But just quickly: Retell the research finding you had about shades of gray that was in that original paper. Guest: Yeah. So, [?] were interested in a very fascinating area of research in psychology, which is how physical states and mental states may be linked in robust ways--unexpectedly. And so, we had some data where people had to judge gray swatches and rate how light or dark they were--match them with other gray swatches. And people do this task; and it's just a perception task: How light or dark is this? And when we analyzed the data, we separated the Liberals and Conservatives from the Moderates. And what we observed is that Moderates were better at perceiving shades of gray accurately than people that were on the far Left or far Right. Russ: Physical shades of gray. Guest: Physical shades of gray. Right? So it really plays into all that argument: I'll feel black and white [?] Russ: It's just so beautiful [?] Guest: [?] It is wonderful. [?] Russ: Brian, I tell this story all the time to people; and when I tell them that result, they always go, 'Yeah. That's cool. That makes sense.' So, go ahead. Then what happened? Guest: So, we could have just stopped there and sent it in for publication, because it is an amazing result. It's beautiful; it's innovative. It would have been highly publishable. But we said, 'You know what? Let's do a sanity check here, because this is kind of amazing. Let's go ahead and do it again.' We have an easy mechanism for data collection so we didn't have any barriers to running the study again. And so we did it again, with a very large sample. And it was--dis[?]. It didn't show a second time. And then we said, 'Oh, my God. Why did we do that second study?' This was the biggest mistake. Like, we had the finding; now it's gone. And so, we didn't end up publishing the paper. We ended up publishing a story about this: that that result was exciting; but the first time we found it was more exploratory. We found a data set that was relevant; we looked at it a few different ways, and we found this great result. The follow-up was a confirmatory test. Right? We put some constraints on ourselves because we had a finding: we had a method that we used to find it; we had a way in which we analyzed the data to observe it. And now that we had those things, now we had constraints. And then we did the study again in the exact same way, and it went away. Russ: Did you do it a third time? Guest: We have not done it a third time, although Matt is talking about it. [?] So few. There's potential here. Russ: Well, maybe the second study was the unrepresentative one. Or is it the case that once it doesn't get confirmed, it's not a reliable finding--period. Guest: No, your first response is right, which is: No one study is definitive. Right? That first study was a good, interesting initial effect. The second study provides some skepticism. Both of them contribute to an understanding of--okay, maybe we're not sure here. And a third study would be useful. We didn't yet follow up on this, but I do think it is still a possible result. But it certainly isn't a publishable result yet. Given the current incentive structures in science, we need to have clear evidence. And so, what would we do if we sent both of these studies to a journal? The reviewers will look at it and say, 'Well, we don't know what to believe here. And so how can we possibly publish this?' Whereas if we had just sent the first one, it would have been much more likely to be accepted--if they didn't themselves demand a replication, which is very rare, historically. Three years ago it was extremely rare. Now, it's more common to ask for a replication of some sort.
12:30	Russ: So, just to add a little twist to your study--which I suspect you did not do--or maybe you did do it. You might have at the same time that you asked people to identify shades of gray and checked their political views, you could have also checked their eyesight. You could have given them an eye test, a physical eye test, and graded their eyes' ability to see. And perhaps it might have turned out that some of the people with the less high quality eyesight maybe ruined your data in that second test. See--let's exclude this group whose eyesight wasn't so good. And then it confirms the finding. And I think the challenge in economics certainly and I suspect this is also true in psychology, is the temptation to remove outliers. To censor the data--meaning to throw out high or low variables--because obviously they are not representative; or certain types of people in the sample: I think I've mentioned this before on EconTalk, but the study that I once read that got front page coverage was on the relationship between drinking alcohol and various forms of cancer for women. And it just turned out that they had excluded women who didn't drink from the sample. Now, it turned out that women who didn't drink actually had higher rates of cancer than the women who drank a little bit. That kind of ruined their story. But they justified removing the women who didn't drink because, 'Well, maybe they just started not drinking.' But, of course, that would be true of all of the sample. So that's a very unattractive reason, to me, for excluding them. But the point is that usually, in many, many situations, we have lots of choices that take place in the kitchen; and if you're not in the kitchen with me, you don't know what I've done. And that's why your work, this project you are doing, is so incredibly important. Guest: Yeah. That's a great example. And it really is the case that we have enormous opportunity. There is substantial flexibility in how we analyze the data. Right? So, eyesight would have been a great one. We could have--well, hang on a second. We had a lot of young people in this. So maybe their political views aren't really yet well formed. Russ: Not fully formed. Guest: The [?]. The more people [?]--let's look at their actual political knowledge--so it's really those that really understand these issues, that this would happen. So we could have analyzed the second study to death and maybe found some moderating influences: It shows up here and not here. And then our finding is preserved. But, of course, the problem is that we've now looked at the data, and the data itself has shades, how we analyze that data. So we are both simultaneously generating and testing the hypothesis, with the same data. And that's a no-no. We can't do that. We cannot generate and test hypotheses with the same data confidently. Russ: And yet-- Guest: The key is: we should do that; we should dig into that data in the second study, because we have our idea. We have our hypothesis. So we do do all that digging. If we find something like that, then we can't stop there. We have to test that with new data. And that's the cycle of science, right? No study is definitive; we are going to explore and learn from our data. And then we need to follow it up with real hypothesis tests where we put the constraints on ourselves. Russ: Ed Leamer has pointed out in his famous paper, "Let's Take the Con Out of Econometrics," and talked about on this program, that when you dig in the data like that--when you have a hypothesis and you test it, there's a standard set of statistical tests to test for significance. If, however, you start going back and forth between the data and hypotheses, trying different formulations, excluding certainly sample points, adding variables, taking some out, you've really left--not 'really.' You have left the field of classical statistical testing, and the tests don't apply any more. But we pretend that they do. Guest: Right. Right. Yes. Exploratory analysis is very important; but it is also very different in what you could conclude from confirmatory tests.
17:00	Russ: Okay, so let's go on to your project, the Reproducibility Project in psychology. What was the plan? What did you--what was the idea and how did it come to be? Guest: So--I guess it's about 4 years ago now, we thought: There is lots of discussion about reproducibility. We've thought about this for years. But we really have no direct empirical evidence about the rate of reproducibility in any field--let alone psychology, my home discipline. So, how would we get some kind of estimate for whether there is a [?] problem in reproducing published results? And the most plausible reason is you take a random sample--or the ideal way, is to take a random sample of the published literature, run replications, and see how many replicate. And that's saying like--that's going to be hard to do. Russ: Yeah; that's a ludicrous idea. No one wants to do it. Guest: No. It is ridiculous. Russ: There's no money in it. There's no fame in it. There's no glory. Guest: Turns out all of those ideas are wrong. Because here I am, talking to you. Russ: Oh, the glory. Guest: The glory. Exactly. But we thought, Okay, so we can't do the perfect ideal, to run a huge sample; but can we get close? What would it take to get close. No one--we can't devote all of the resources in our own lab to do this. But there are a lot of people that are interested in this issue, and maybe we could get a group of people together, to run, 15, 20, gosh maybe 30 replications. And that would at least give us something. Somewhere to start. Because we just need something to ground this discussion. There's lots of mudslinging. There's lots of hoohahing. There's lots of arms-waving about the reproducibility challenges or not challenges. We need to ground that in some evidence. And so we just started to say: Let's each contribute a little bit. Let's crowd-source this and have a common protocol for how we are going to select studies; how we are going to develop the replications themselves; how we are going to adopt that; how we are going to interact with the original authors. And we'll write a paper with the aggregate results. And so, no one person or team will have to spend too much of their resources on doing a replication. But, collectively we'll get something interesting out of it. And so we started just by creating our own web page; and this was at the same time that we built the open science framework, which is our scholarly commons, a free, open infrastructure for people to manage their research. And so it became the infrastructure for support in the Reproducibility Project. And then we sent out an email and said, 'Anybody want to join us?' And very quickly it was a team of 50 people, who [?] that it actually might work. And then after a few months it was a team of 100. And a few more months it was a team of 150. And so, our aspirations of how much we can actually get done sort of grew in response to the fact that there was a real community of people in the field that were hungry to try to get some evidence, some understanding, of whether this is a problem and to what extent it is and what we might do about it. Russ: Now, were you worried that some of them were too hungry? Kind of fun to tear down a famous study that your mentor's enemy did? Guest: Right. Russ: Ruined your mentor's research? So, how did you deal with that? Guest: Certainly in replication, that has been the perception of why one would ever do a replication: Is as a hostile act. Right? Replication is a threat, not a complement in the pressing culture of science. Because, 'Why would you question my work?' is the immediate reaction. Rather than, what we would idealize as the response to replication, which is: 'Oh, my important finding is important enough that someone else wants to reconfirm. That should be how it is. But it isn't how it is. So, there is that challenge. And there certainly is variation in people's product[?] curves of what they think [?] the problem is. And whether they think particular effects will replicate or not. So, the way that we tried to manage that in this process was to provide constraint on us as replication teams. And there are a variety of constraints. But a few of them are that we define a sampling frame of a particular set of studies that were eligible for replication. And then we slowly made studies available from that set, to minimize selection bias and to maximize just trying to have enough flexibility so that we can [?] studies with the teams that have the relevant expertise experience resources. So, what that--part of what that trying to do, is minimize the likelihood that individual team members would say, 'Oh, there is this particular study I don't like.' Rather, they are just looking at these particular studies, from this particular year, from this particular journal. And they don't have any strong stake in any of them. But they are just looking at them for what trends[?] they can do with that. Russ: So you didn't have--go ahead. Sorry. Guest: So, yeah, that was part of the constraint that we had to ourselves. A second element of the constraint on their replication teams was we required interaction with the original authors, to really do the best job that we can to have a good faith replication. So, trying to obtain original materials from the original authors; getting their critique of the design for [?]reproduction; and documenting--if they continue to have concerns, documenting those and servicing them, so that a reader can evaluate, themselves: Was this a fair replication or not? Do I believe this one or not? And then the final part was just transparency of the entire process. So, everything about the process: All of the materials, all of the data, all of the designs, the original protocol--all of that is available on the Open Science Framework, so that others can dig into it and decide what they think. And that includes commentaries from the original authors about the results. And those are attached with the results of the original authors [?] to write a comment to. Russ: What's fantastic about this is that there was a study, replication attempt, outside of your project. It was a very, very high-profile study that had shown that when hearing words related to the elderly, being a senior, or being old, people that left an experiment were slowly--than when they didn't get stimulated by those words. And when I saw that result I was kind of skeptical about it. I guess somebody else was. And they tried to replicate that finding and they couldn't replicate it. And I remember the hostility that the original author gave back to the replicators. Because the authors said, 'You didn't do it right. You're tone of voice must have been wrong. You didn't pronounce 'senior' correctly.' There of course are always things you can point out that might have made it plausible that the replication would fail even though the result was true. And what's nice about what you've tried to do, is that you've tried to make that process less of a debate and more of a catalog of what actually went on. So, I think that's phenomenal.
24:51	Russ: So, let's talk about the universe of studies. What was the universe you ended up choosing for replication? Guest: So, we used 2008 as the year of publication. And then we selected three journals from that year: Psychological Science, JPSP, Journal of Personality and Social Psychology, and JEPLMC, Journal of Experimental Psychology, Learning, Memory, and Cognition. And then we started, with the first article that was published in the first issue of the 2008 year, and we just moved forward from that initial article. And then, there were a lot of studies in the sample, as it accumulated, that we couldn't feasibly do. This was a volunteerism project; it was built on minimal resources, at first. We did get a grant later. But we did this just with people who were willing to put in some time. So there were studies that were in this set that were--longitudinal studies of 3000 Dutch and they measured them over 5 years. We couldn't put the resources in to do a replication of that. And so there is a selection bias in that the studies that we were able to do from this frame are a subset of the total studies that we could have done. And there are reasons--most of them being feasibility constraints--for ones that were included, versus ones that were not. Russ: And you ended up with a hundred studies, correct? Guest: A hundred completed studies, in a report. Yes. More were claimed, but some of them didn't get finished. Russ: So, I have just a technical question. If I have a study that had, you know, 5 findings: is that one replication, or is it five? How did you decide what to replicate within a study, within published article? Guest: Yeah. So that's one level now down deeper: is that, once we identify the article, what's to be replicated? Some of these articles had 6 independent studies, and within each of those 6 studies might have had 10 or 12 findings that they report, talk about. So, the selection process we had for that, just to have, again, some constraint for us that we could focus on a particular thing with the implications for bounding our inference, is that we by default selected the last study. So that we wouldn't have selection bias of looking at all the studies and trying to pick which one we thought would be most or least, whatever, likely to replicate. So, start with the last one. And then if it's not feasible-- Russ: What do you mean, the last one? Guest: So, a paper will have studies presented in some order. And whatever the order is that the researchers decided to report those. And so, just reading through the paper, whether this is Number 1, that's Number 2, that's Number 3; okay, we start with Number 3. And, identify the key finding from that study. And so it's trying to narrow it down to a single test that occurred in that study. If it turned out that it was infeasible to replicate the third study, but the second study was feasible to replicate, then the team could move and do the second study. Sometimes even the original authors recommended moving[?]. Russ: But, by 'study' you don't mean paper? Guest: That's right. Within a paper, might be many studies. Yeah. Russ: Got it. Guest: So, in psychology this [?] where it's a very simple paradigm, and so where they run 5 different experiments, all in sort of a similar kind of question. And then write one paper up. Russ: Okay.
28:39	Russ: So, let's--so you ended up with--in 2008 there were an enormous number of articles written in all three of these journals. But you ended up with 100. Is that just because it's a nice round number? Why did you get to 100? Guest: So, we got to 100 because we really, really wanted to get to 100. That's the actual answer. Russ: It's like Thomas Jefferson. Thomas Jefferson died on July 4, 1826; he wanted to get there. Guest: Right. Yeah. So the original aspiration was 30. And then we thought, once there was a lot of people involved, we thought maybe we can get to 50. And then, once we got a grant from the Laura and John Arnold Foundation to help support the project, we aimed for 100. And, more is better, with data. We have more of these occasions then we can have more confidence in our inferences, more precision in our estimates. And so, 100 has an appeal just in its roundness; but it also was the target that we committed to, to our funders and to the team, to say how much can we actually give them. And it was an enormous effort to get there. Mallory Kidwell and Johanna Cohoon, the two project coordinators were just pushing and pushing and pushing to get us to that number. Russ: What were their names again? Guest: Johanna Cohoon and Mallory Kidwell were the coordinators. And so we had charts on the wall in our office about [?] on this; we had markers on which studies we were going to get done, which teams were making progress, where are they. And we were just really pushing to get to that. Russ: So, to get to 100, did you go past 2008? You did, right? Guest: No-- [?] Yeah. So, there were 488 total possible articles we could have selected from. I think it was 165 or so of those [?] eligible: so they were in the frame that could be selected from. And then, of those 113 were actually, replications got started. And 100 of those got finished. Russ: And you realized, of course, that there was a financial crisis in 2008, so that kind of really throws off the whole--and it's also the year after the Red Sox won the World Series for the second time, in this millennium. There's a lot of problems. But 2008, we'll stick with it. You decided to pick a year that was recent but not too recent. Guest: Yeah. We wanted something--so all of these were sort of decision factors, right? Because the ideal is a completely random sample. But a completely random sample of what? Where does psychology end and other disciplines begin? So that would have been too hard to define. And then, there's also a lot of journals that are very low impact, and so if we do a completely random sample then we get a lot of studies where people would say, 'Oh, but we don't take those studies very seriously. It's really the studies in the important [?].' So we constrained to journals. And we constrained to 2008 because--I think we started in 2011; what we really wanted to have recent enough where we could get the original materials. Right? Failing to reproduce because we can't get the materials is one reason that is less interesting; there are potential other reasons. But at the time, we wanted it far enough back where we could get estimates of how much impact those were having. So we didn't pick the year that we started as the year, because we thought, 'Oh, we'll finish this in 6-9 months, so we need some time.' Of course, it took 3 and a half years, not 6-9 months.
32:23	Russ: So, I'm going to read the abstract from the paper that you and the rest of the team published in Science recently; and then we're going to talk about the findings. So, I'll read the summary, the abstract, first: Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams. So, I want to talk about three of these findings. The first is that--the one that got the most press--"39% of effects were subjectively rated to have replicated the original result." And that number, 39%, got bandied around a lot in the news accounts of your study. Talk about what it means by--first of all, how did you measure success? How did you measure it replicated the original result? And why is the word 'subjectively' in there? Guest: Yeah. So, that abstract summarizes 5 different ways of characterizing reproducibility. And this is important, because there is not--what does it mean to have not replicated the original result? It seems like an easy question, but when you start to unpack it, it's a very hard question to answer. And so we use the multi-methods. Right? We use 5 different ways to sort of look at what the relationship between the replication and the original results were. And one of those generated the number 39%. And so the subjective ratings were the result of the replication team finishing the study, doing analysis, looking at the results, and then deciding: when I look at this and I look at the original result, is the replication a success or not. And they gave a [?] no answer, until it was their subjective judgment of that. That one was very highly correlated with another one of the criteria, which is: Was it a significant result? Did they obtain a p-value of less than .05 in the replication? Russ: Stop there and explain what a p-value of .05--why that matters. Guest: So, the p-values are the standard in most areas of science that use inference testing for deciding about the credibility of a particular piece of evidence. And what a p-value means is, assuming the [null--Econlib Ed.] hypothesis is true, how likely is it that we would observe this pattern of data--the null hypothesis, I should say--that there is no relationship to observe--how likely is it that we would observe these data? And when these data are extreme--they are quite different than what you would assume if there was no relationship--then you get a low p-value. And so, lower p-values are better in the sense they are saying: These data are unlikely to have occurred with no relationship. And so then, people want to say: So, then that means that the [?] alternative hypothesis, that I think there is a relationship, is true. And so, p < .05 is the de facto criterion in most scientific research for whether the effect is considered significant and you can claim that there is evidence for your effect. Russ: And it's important to point out that the word 'significant' here, in statistics, is not the same as when it's used in everyday language, which in everyday language 'significant' means important, has a big effect, matters. It matters. But in statistics, it just means that it's unlikely to be true by chance. Which, it still could be a significant finding statistically but insignificant. So, a 20% increase in some variable could have a significant impact on another variable, but the change could be so small that it's not worth worrying about, or it's not something that you'd normally care about. But it could be statistically significant. So, just important for our listeners to keep that in mind, who aren't used to these kind of words. Guest: Yeah. And this came up in a big news story very recently about: Does bacon cause cancer? And, all of us bacon lovers were very concerned about this particular result being reported in the news about bacon being linked to cancer. And also, there was some discussion about whether it was as strong as or stronger than smoking: 'Oh my gosh. How can that be?' Well, it turns out it was a statistically significant result, unlikely to have occurred by chance; but the effect size was not very large. And so that is this comparison, as you are talking about: The importance, how much it predicts, can be very small even if it's measured highly [?].
38:15	Russ: So, we're putting that to the side, because we're going to stick with statistical significance right now. So, let's go back to the 39% number. I interrupted you to clarify p-values. So, people made an assessment of what? Guest: Of whether they thought the replication was consistent--the finding and replication was consistent with the original. So, was it successful in reproducing that original result? And they just gave a yes/no answer. Russ: But, how did they make that call? Guest: So, just by looking at the evidence. And the correlational data suggest that mostly they used the p-value to make that assessment. And might have been influenced to some degree by what they observed as the effects of how strong a result was the original; how strong a result was the replication. Russ: So, the average finding, if I have this--let me get back to the summary. Right. So the average finding was half the size of the original. Guest: Yeah. Russ: So, that means that some of them were less than half, I assume? Guest: Yeah. Quite a few. Russ: Quite a few. Quite a bit less than half. They could still be statistically significant, though. Guest: They could. Russ: So, who made the call about whether that was a replication or not. Let's say you find some effect, it's claimed, you try to reproduce it; it's only a third as large but it's still statistically significant. Is that a confirmation or is that a failure to reproduce? Guest: Well, because we have 5 criteria, it's both. Because in some of those criteria, like the 36% of statistically significant results you would count as a success. In the effect-size comparison, it would look like much less of a success, although that's sort of continuous: it's not a discrete yes or no; it's a comparison of the effect size. And in the subjective assessment, it's whatever the replication team looking at it says, 'Well, yeah, it's smaller but they are showing the effect; we're showing the effect that they found. It's smaller but it's still the same effect, so we say it's a success.' Russ: So, let me restate it again, two lines from the abstract: 97% of the original studies had significant results, which is impressive for those other three that still got published, 3%; but 97% had significant results; 36% of the replications had significant results. So, that's about 39%. I assume. Doing the math in my head. Guest: Yeah. Russ: So, the failure to find a significant result is overwhelmingly going to be the measure of whether it was a successful reproduced result. Guest: Yeah. Those two criteria were the most strongly related to each other of the 5 different criteria we used to evaluate replications' success.
41:08	Russ: So, I want to step--is 39% a high number or a small number? It's interesting. Some would say that's appalling; some of us would say, 'Wow, 39%. Good for you. Good for psychology.' It is something of a wake-up call. I'm curious what kind of reaction you are getting--not from the authors; we'll talk about them in a minute--but from the field in general. When you've published this recent overview that summarizes the findings by saying, here's the way I would say it: About 60% of the studies failed to replicate; and about 40% held up, at least under this one test. What's the reaction of most people in psychology? Do they go, 'Yeah, I always knew my field was a fraud?' Or did they say, 'Wow. There's a lot more science here than I thought'? Is the glass half empty or half full? Guest: Yeah. My experience with the reaction to this paper so far has been that it operates as a Rorschach test--so you get a lot of information about people's prior beliefs, by how they evaluate the results of this paper. So, people that have thought that there are challenges and problems, look at this and say, 'See? Now we need to do all of these changes.' And 'Now we need to fix this; now we need to do that.' People who have not thought there was a problem, thought that this whole discussion has been overblown, saying, 'See? Replicators are incompetent.' So, there is somewhere in between. We don't know yet. But it is a paper that is serving the purpose of having done the research in the first place. Which is, to ground these debates, these priors, these different beliefs about the challenges, into something where we can actually dig in and start to unpack it. So, what's really great that's already happening is that all of these data are open; all of the methodology is open. And there are lots of people with very different prior beliefs about whether there is a problem or not that are digging into it--critiquing it, finding new things, finding things that suggest the problem is worse, finding things that suggest the problem isn't so bad. Based on these data. But it's a data-driven discussion. Now, they're hypothesis generating, right? This is exploratory at this point, looking at these data. But nonetheless, it's using data rather than just whatever people think based on their Internet, Google experience. Russ: [43:39] So, I got interested in this decline issue from an article that Jonah Lehrer published in The New Yorker. Since then Jonah's fallen on hard times. But there was an article about Jonathan Schooler, the psychologist who had found this provocative effect, counterintuitive effect; and then went back--somewhat akin to your shades of gray study--and by the way, that's what you'd have called your paper if it had been published--'50 Shades of Grey'. It's a shame. Who knows? Maybe you'll resurrect it. But anyway, Jonathan Schooner-- Guest: Schooler. Russ: Schooler, excuse me--Jonathan Schooler found that when you went back to re-test a result of his that it got smaller. I think by about half. Then he went and re-tested it again a few years later; it got smaller again. And this came to known as the 'decline effect.' Do you have any thoughts on that? Especially given that you've found that now in this paper? Why should we observe a decline? It's hard to imagine that somehow over the last x years, people aren't as good at y or z, whatever is being tested, that the effect has somehow diminished, by sunspots or whatever. It's alarming. Guest: Yeah. So, we don't know what the answer is. But the most available explanation, meaning the one that drops right out of this whole debate about reproducibility, is that the decline effect is a function of the publication bias: This sieve of what gets through from actual research being done to research being published requires statistical significance--getting that p-value below .05. But it's also done in a context where research is pervasively underpowered. We don't collect enough data for testing the questions and the effect magnitudes that we can expect in the kinds of research that we do. This is a pervasive problem; it's been discussed since the 1960s. But the consequence of those two things happening simultaneously--low power of research and requirement for statistical significance--means that the only way you can get statistical significance is to take advantage of chance and happen to observe larger-than-reality, larger-than-what-are real effects. That, just because I run 5 studies, I am investigating a true effect: one of those will happen to be larger than it really is and obtain that statistical significance. And that's the one that gets published. So, if that is occurring at a pervasive scale, then the results of the reproducibility project are exactly what you would expect as a consequence. Which is: most research is actually when you just do it and report it regardless of whether it's significant or not, is going to estimate smaller effects than those few that get through to the publication. I think. Now, I'll say that that's not necessarily the only explanation, because there's other ways that we could understand the decline effect. And there's lots of reasonable hypotheses. And many of them can contribute. For example, we could have observed smaller effects in the replications because the replications were not tuned to the particulars of the sample and setting in which they were conducted, like the original research was. Right? So, it's possible that those original authors designed their study in such a way that it would obtain the largest possible effect they could in that original setting. And then, when the replication did it in the new setting, they didn't change it enough or tune it enough so that they would get that strong an effect. Because there's other factors that bring things in this whole [?] Russ: [?] tunes in ways that aren't observable, that aren't listed in the protocol. Guest: Yeah, or that you wouldn't know to change. That are subtleties. Right? What kind of language you use to communicate to those kinds of participation[?] effects. Russ: Facial expressions. My favorite example, this is the--I may have mentioned this before on the program but it's so good I should use it at least once a year--which is the studies that showed that people have ESP--they have Extra Sensory Perception. And there was a test done with cards. So, I put a card down on the table, and I say, 'Guess what suit this is.' And people would guess. And then, some people were shown to have a much higher than 25% ability to identify the suit. But it turned out that what had happened was, in doing the experiment, the people who had done the experiment would do a practice. So, I'd say, 'Guess what suit--I want to get you ready for the experiment. Guess what suit this is.' 'A heart.' 'Oh, you got that one right. How about this next one?' 'A club.' 'Oh, you're two for two. That's great. Well, let's keep going.' Whereas, if they didn't get the first few right, they'd say, 'Okay. Now we'll start.' Once you do that, you kind of--and if you didn't write that down, and if you didn't have a video camera, you couldn't reproduce the finding that people have Extra-Sensory Perception. You'd find it didn't.
48:48	Russ: Anyway, I want to ask a question that's not in the science paper, in the summary of the results. Obviously, what you did is incredibly--I think it's incredibly important. And really, it's glorious even if you don't get any glory. And I think you do get some glory, so I'm very happy for you. Guest: Oh, I have been, but the 269 others have not gotten their share of glory. So, I am benefiting from all of their work in actually getting this done. Russ: Well, they are listed. They are listed. Guest: They are listed. You dig down deep, you can find their names somewhere. But they did this because they believed in the importance of the question and volunteered their time. They provided a service. Russ: And bless all of them. Guest: Yeah. Russ: But the question I wanted to ask is this: So you found out something very general, that's interesting about the reliability of psychology of [?] "on average." If you did it again, with a different hundred articles, it would be interesting. You might get 36% reliability, or 48%, or 18%. Obviously if you did not--if you had an infinite budget and you could try to replicate some of the larger studies, you'd get a different result. But you also may have learned--I assume you learned something else that I didn't see in the Science article, which is: There are certain stylized facts--and I'll choose behavioral economics because it's the interface between our two disciplines. So, there are certain stylized facts that have emerged from lots and lots of experiments that people assume to be reliable. Just to take one example that people feel differently about losses versus gains. So, that reaction to that is asymmetric. So, there's a bunch of those findings. And I listed one earlier, in psychology--that when you hear the words 'old' or 'senior' or 'AARP (American Association of Retired Persons)' you start to move more slowly. So, those are less confirmed, obviously, because they might be based on one study, one paper. Is there any pattern in the un-reproduced studies that shouts out at you from the hundred replication attempts? Did it turn out that, 'Well, the 60% that failed, roughly most of those were in such-and-such an area? I guess those were questionable.' Did you learn anything qualitative that you can share? Guest: Yeah. We did exploratory analysis of different characteristics of findings. Because it would be highly beneficial to be able to predict challenges from reproducibility. And that doesn't explain, but it can at least predict when they occur. So, one that has generated the most discussion, at least within the field, is the fact that we successfully reproduced findings in cognitive psychology, by looking at the basic operations of the mind perception, memory, attention. We reproduced those effects at twice the rate as we reproduced social psychology effects--[?], understanding ourselves, interactions with others, stereotyping beliefs about people. So, why? The obvious question is: Why? Why is it that we observe twice the rate in cognitive than social? Is it that social is doing worse practices, and so they are not--they are not-- Russ: It's obvious. Brian, those the [?] honest people. Sorry, it's your field; you've got face it. Guest: So, there we go. So, you've solved the problem. So, it is possible that there is something about the research practices. It's possible there's something about the people. It's possible it's something about the kind of things that are being studied. Russ: Yeah. Probably that. Guest: One obvious thing that at the latter end is that social psychology is investing in things about the social experience. About the context in which behavior occurs. And doing replications necessarily changes the context. So, it's quite possible that the replications have much meager evidence than the originals because they didn't, weren't sufficiently attentive or weren't able to address those [?] in context. So it really is very [?] a cross[?] effect. So that's a possible explanation. A lot of people, especially those that are involved in social psychology research, will gravitate to that kind of possibility. Of course, we don't know if that's the explanation, because other possibilities are also plausible. Another difference between social and cognitive psychology is that most or many cognitive psychology research applications use what are called 'within subject designs': I am my own control because I am in multiple conditions, multiple treatments of the experiment. You flash me words on the left side of the screen; and then you flash me words on the right side of the screen. And you compare my responses between the two. That is the highly-powered design because it reduces a lot of error. You don't have to compare me in one treatment issue to you in a different treatment issue. Russ: For sure. Guest: But social psychology tends to use more between-subjects design. You can't, in the same experiment easily increase my self-esteem and lower my self-esteem. And have me feel like that makes any sense at all. And so we need to use more between-subject designs. And those tend to be lower powered. And so we may be seeing a consequence of some of the methodological differences between fields, produce that kind of thing[?].
54:27	Russ: So there's a--there's a group at Berkeley, it's the Berkeley Initiative for Transparency in the Social Sciences. And I was happy to see that they recently launched the Leamer/Rosenthal Prizes--prizes, plural, for open social science awards that honor research that is particularly transparent and open about reproducibility. Being an economist and a fan of Ed Leamer-I don't know Rosenthal but I was happy to see this. And Leamer's essay on the launching of the prize is phenomenal. I recommend to people to put a link up to it. I notice that you are on the Executive Committee of the Berkeley Initiative. What's going on? Why is this issue getting so much attention these days? Five or ten years ago when you and I would grouse about this, complain, be worried about it, people would say, 'Yeah, yeah, yeah.' And nobody would--they'd just keep publishing. All of a sudden, social science--and I think science is next--social science is very self-aware. Social scientists are very self-aware. This issue: why do you think it's now? Guest: It is a curious phenomenon, because methodologists have been talking about this for 40 or 50 years, writing papers, now and again. The same issues have been coming up over and over. Bob Rosenthal, the other part of the Prize, has coined the term 'File Drawer Effect'. And it was talking about how lots of the research disappears in the file drawer, and that's creating a bias--this publication bias. And talks about all these issues, in the 1970s. So, why is it now? Well, there's easy ways to identify reasons within particular fields. But the weird thing is it's happening across all fields. And it is, has extended to the life sciences and physical sciences, dramatically so. And so I think what has changed--and I can't really say this is the cause, but what is different this time, and how it has become such a discussion is that issues of reproducibility have come up in many different places, sort of all at the same time. And that has produced a collective discussion that everyone is involved in. It's not just methodologists. It's not just practicing researchers. It's also funders. It's also federal governments are paying attention to this as an issue. And it's also technology groups and organizations that are trying to do things to address it. And so it is as if there was some tipping point that just turned this into--we've got to deal with this. No more ignoring it. All of this is sort of collectively moving us to action. And, you know, we talk about particular events in different fields. But I think a large part, what is common across those fields, is it's become a lot easier to examine the current published literature. The Internet has provided some value for scientists. And that is doing large-scale search and discovery of the credibility evidence in the published literature itself. And so a lot of the interesting studies that were done over the last 5 or 10 years really look at publication practices--publication evidence--and can make more general conclusions because of that, about the potential challenges. Russ: How big a problem is reproducibility in the physical sciences? I used to think this was my problem--meaning economics. And then I thought, 'Hey, whoa, psychology has got some issues.' And sure enough, of course they do. I was talking to an astronomer friend of mine, and he was bemoaning how dishonest and, not dishonest but biased, the results in his field are. And you think, well, Astronomy? Isn't it just black and white there? And the answer is: Of course it's not. There's the way that the data is filtered through. It's not just like--it's not accounting. It's not just saying how many apples are there in this bushel? Inevitably there were too many decisions to be made about what gets included in the data, how the data are measured and what gets into the data set and then how it is manipulated. In every field. So do you see this as going to be, as being a wake-up call for science generally? Guest: Yep. Yeah. It is pervasive across the sciences. And I think for two particular reasons. One is that the incentive challenges are the same across the scientists. Practicing scientists are competing very hard to get a very limited number of jobs. And then advance, through the career[?] ladder, in those jobs. Doesn't matter if you are a physicist, a chemist, an economist, a psychologist, or anybody else. Those pressures are there. And they all have a shared publication practice pipeline. There's some variations across disciplines in how that operates in particulars, but there is the same kind of [?] that people have to get. The second factor is that everybody is working on problems that they don't really understand the answers to yet. That's why they are working on them. So, it isn't that physicists keep going back to their office and keep saying, 'Oh, I'm okay I'm going to drop this feather and this bowling ball one more time,' and seeing which one lasts first. Do we use a vacuum or not? Oh my gosh, do we need this number, friction? Whatever it is. They are not doing that. They are in on problems where there is a lot of choice, there is a lot of uncertainty, things that are not understood about the phenomena. And so of all of that subjectivity, flexibility, decisions about what's important not to uphold--that happens everywhere. There's only that very limited set we can point to where this is a strong confirmation test that's happening, where there's a very strong model. And very strong expectations of what ought to occur or what not to occur, that can test against that model. Like high-energy particle physics. Spent billions of dollars on building tools in order to test some very important principles of how, what particles [?]--boson experiments, were critical experiments. They built replication into it. And very, very high standards of evidence. And it was testing against a very strong prediction. It was model science. And it cost billions and billions of dollars. That's not how physics works. So, these challenges aren't happening just in the domains that we are exist in, which don't yet have strong models. But even in places where models are mature [?]
1:01:10	Russ: So, what's next? What's going to be the next big project at the Center for Open Science? Guest: So, relating to the reproducibility project, and our discussion we just had, is that, we need to have more information about reproducibility across domains. And even deeper within domains. So, this project, even though it was a very big project, was just one study. It isn't definitive. And there are many things that might be changed, be improved, be deepened, and trying to understand, reproduce firmly. So, we have an active project that is modeled on the, with the recent project in psychology, in the cancer biology. So, we are doing 50 replications of permanent results in that field. And that's ongoing now. And then we've been talking to groups in other disciplines. We're doing similar kinds of projects in their disciplines. And we ware not experts in all of these different disciplines. We have more of the: How do you run a project like this? How do you do the administrative expertise? And then the just general expertise, reproducibility, regardless of the content of it? So, we are trying to help groups get these kinds of projects going. There is a team doing one of behavioral economics. We're not directly involved in that. We're excited to see what they find Russ: Yeah. The related area in economics is, the need to be looked at, some of the experimental results is development--the deworming work is under some seriour issues of reproducibility. And also the Decline Effect is there. And we also see the original study on deworming had these wonderful effects. Now it doesn't appear--they may not be as large, or they may not exist at all. So, it seems to be very, very important. Before we close, I just want to ask you, I wanted to hear you talk about the reaction of the author whose studies have been unconfirmed or unreproduced. I suppose it's possible that some of them had more than one study in 2008. In one of those three Journals. Did any of them have multiple failures or successes? And in general, have they screamed? Guest: On the first question I think there was only one or two people that had more than one study in the sample. And actually I don't know what the outcomes were for multiple ones, for [?]? But in terms of the reactions of the original authors, they've by and large been very positive. That doesn't mean that they haven't been skeptical. But by and large they've been positive and engaging with the project. It isn't just that people doing the reproducibility project that care about reproducibility. Most researchers care about it. And so there was a lot of good collegial interaction between original authors and replication teams. And doing the [?] tests. And of course lots of reasons to be skeptical and concerned, because they do have skin in the game, and this, in the current culture of science, people are, do feel some degree of ownership, and investment in the results that have come out of their research. And so the fact that they were by and large possibly engaged was just very encouraging to me about the state of science and the potential for addressing these challenges. Russ: I guess there would be less collegial if the journals withdrew [?] the articles ex post. You know: 'A Special Issue--We have to now--the following articles are not, no longer going to be in the online archive because they have been found to be unreprodocible.' Guest: Right. Yeah. That would not--and of course that would not be way be one replication--just one attempt. And it could have messed up. We don't know. So; and I should note: that it hasn't been uniformly been positive. There were challenging interactions in a number of cases. There were people who said, 'I don't think was done to the standards that I would have wanted it to be on in terms of replication of their study. And we are trying to surface all of that and make original offers and opportunity to talk about what their experience and expectations, and how they revise or not their beliefs about their research[?} that they had done, based on their replications, because that really shouldn't be part of the conversation. Replication is essential. And original authors have a perspective and often a [?] expertise in the particular thing they were studying. And so you have to have that be part of the conversation--you know, when a replication occurs. And term that conversation into one of puzzling over how can we get it right? It isn't a contest to see who is right. It's how can we work collectively to get it right in order to have a strong, credible base so we can all be confident and use to solve the social problems.

Time

Podcast Episode Highlights