Andrew Gelman on Social Science, Small Samples, and the Garden of the Forking Paths
Mar 20 2017

forked%20path.jpg Statistician, blogger, and author Andrew Gelman of Columbia University talks with EconTalk host Russ Roberts about the challenges facing psychologists and economists when using small samples. On the surface, finding statistically significant results in a small sample would seem to be extremely impressive and would make one even more confident that a larger sample would find even stronger evidence. Yet, larger samples often fail to lead to replication. Gelman discusses how this phenomenon is rooted in the incentives built into human nature and the publication process. The conversation closes with a general discussion of the nature of empirical work in the social sciences.

RELATED EPISODE
Brian Nosek on the Reproducibility Project
Brian Nosek of the University of Virginia and the Center for Open Science talks with EconTalk host Russ Roberts about the Reproducibility Project--an effort to reproduce the findings of 100 articles in three top psychology journals. Nosek talks about the...
EXPLORE MORE
Related EPISODE
James Heckman on Facts, Evidence, and the State of Econometrics
Nobel Laureate James Heckman of the University of Chicago talks with EconTalk host Russ Roberts about the state of econometrics and the challenges of measurement in assessing economic theories and public policy. Heckman gives us his take on natural experiments,...
EXPLORE MORE
Explore audio transcript, further reading that will help you delve deeper into this week’s episode, and vigorous conversations in the form of our comments section below.

READER COMMENTS

pyroseed13
Mar 20 2017 at 10:43am

This was definitely one of the most interesting EconTalk episodes I have heard in awhile.

Dave
Mar 20 2017 at 12:24pm

I love when my worlds collide. I started following Andrew Gelman’s blog shortly after I started listening to EconTalk about nine or ten years ago. His blog has influenced me so much, and it’s a big reason why I changed careers and am now completing a Masters degree in Statistics in my free time. Being a Gelman disciple has been a double-edged sword during school. On the one hand, I am able to critically evaluate the methods that are being taught to me. On the other hand, it make me feel like much of what I’m being taught is poor practice, so why am I learning it? Luckily much of the applied math learned is applicable to better methods as well.

Anyway, it was great to hear two of my favorites converse on such an interesting and important topic.

Nonlin_org
Mar 20 2017 at 1:45pm

An easy fix would be to describe in detail the testing plan prior to doing it and then sticking to that plan.

Alternatively, the p-value threshold should change if more than one useful outcomes are analyzed – say from 5% to 5%^2 when either one of the two outcomes would be worth publishing.

Jonny Braathen
Mar 20 2017 at 2:49pm

When Alfred Nobel established his prizes, he wanted to give prizes to peace, literature, and science. He knew science and that is the reason there is no Nobel prize in economics.

Economics, like political science are social science, not sciences.

In social science, the “scientists” have the conclusion first then get their data. No good controlled double blind testing.

Good example: free market economics – the conclusion is already there.

This is why social sciences lice economics and political science are closer to religion that chemistry and physics.

This is also why many economists are struggling to understand what a scientific theory is.
Economists will use the term theory like a thought, not the way a scientific theory is defined.
Just look at how many economists and especially free-market economists are looking at the theory of evolution or climate research.

You can easily tell that they are not trained scientists, but are coming from social science.

Alfred Nobel was and still is right

Phil
Mar 20 2017 at 5:59pm

One issue with Russ Roberts’ “skeptical” stance is that it appears to be based on an inaccurate definition of “scientific”. This is a common problem with people who take issue at economics or other social sciences being presented as science; see Jonny Braathen’s contribution above.

I will not argue that the predictive power of, say, psychology, epidemiology or even biochemistry is in any way comparable to that of particle physics, but the vision of “science” that social sciences are held up against is one derived from high school physics myths about great scientists sitting down, writing down equations, and then testing them under perfectly controlled conditions.

A very telling example of that is when Dr. Roberts condemns the “10s of thousands of scientists” who analyze data, producing “inaccurate and misleading results”. Nevermind the obvious question of whether they would really be producing better results if they used “pure thought” instead, producing “inaccurate and misleading results” is pretty much the history of science! With the exception of mathematicians, most scientists, including luminaries like Newton or Pauling, spend most of their time in blind alleys, producing nothing of value, or even misleading results, when looked at with enough hindsight. Their main works are littered with false statements and remarks that appear puzzling or incongruous to us, now that the one or two great ideas that they produced over their career have been fully digested and understood. Needless to say, for the average scientist the situation is even worse, when viewed from far enough. However, from the collective confusion, very, very slowly but surely, small pieces of insight emerge over years.

Just because you can point to differences of opinions or past or current controversies doesn’t diminish the scientific value of anything. There are enduring controversies in down to earth fields like classical mechanics! Indeed, from a strictly rigorous theoretical point of view, a lot of the methods and approximations commonly used in engineering are not completely “sound”, or at least fully understood.

More importantly, the controversies among experts surrounding statistical analyses of things like the minimum wage or immigration, although they certainly arouse passions, are usually about effect sizes, and the ranges under discussion are typically quite a bit smaller than what laymen imagine.

“Humility”, a word Dr. Roberts uses often uses, doesn’t consist in simply refusing to ever stick your neck out and use the available data and theory to make educated guesses, instead relying on your “gut”. Humility means recognizing that when it comes to science (or anything, really), definitive truths will always remain elusive. A better understanding of a certain issue might not even come in one’s life time, but the alternative, giving up, is absurd.

The problem is with the idea that science somehow gives immutable, final answers, rather representing our current best guess, however muddled, about a certain phenomenon.

Dr Roberts claims that many (most?) scientists who use statistics do not understand this, and believe that once they have gone through the motions of some complicated technique, they have somehow definitively “proved” that their result is true. As a scientist, this has not been my experience at all. Indeed, scientists are keenly aware of the fact that, even in “hard” sciences, you can never “prove” something is true, let alone by statistical trickery. If anything, scientists are overly committed to the Popperian notion of hypothesis and falsification.

Once you have finished your analysis, then by all means look at your results critically if they seem surprising. But the alternative (going “with your gut” without even analyzing the data) is sure to be worse.

Russ Roberts
Mar 20 2017 at 9:41pm

Phil,

When I was talking about tens of thousands of scientists looking at data, I was actually thinking of social scientists trained in the techniques of econometrics and statistical analysis and taught to use statistical packages such as SPSS or SAS or STATA.

I wasn’t thinking so much about them looking at data. I was thinking of them sitting in classrooms as I was, being taught the tools of statistical analysis without a single word of warning or doubt about their reliability or verifiability. We were taught that this is the way empirical work gets done. I think the current evangelical zeal of the profession for these techniques is even stronger.

Virtually every use of those techniques is done with data generated in complex systems with multiple causes, imperfect measurement, and many omitted variables. The theory of econometrics deals with these issues of course. But in practice, these problems rarely get in the way of the conclusions we come to in the profession.

This is nothing like what goes on in a chemistry lab or biochemistry experiment. Yes, science is full of errors, mistaken theories, imperfect theories. But those theories get improved, narrowed, refuted, rejected, replaced. I understand so-called real science has biases, pet theories, and that data and evidence gets squeezed to confirm those biases and theories. But generally over time, progress gets made.

I am unconvinced that this happens in economics with the ever-widening river of empirical findings that is becoming the staple of the profession these days. My recent piece of these issues tries to lay out these problems more thoroughly.

Eric Schubert
Mar 20 2017 at 10:33pm

Much like Dave and pyroseed13 at the top of the comments here, I truly enjoy episodes focusing on this topic. I felt Russ’ metaphor of the tendency for statistical techniques and results to be used as a bludgeon for ideological opponents was dead on. I would add that today it seems there is an asymmetry in this regard when moving left to right across the political spectrum, with the former more likely to invoke such arguments to make their case. As someone who lives north of the 49th parallel, this observation is enough to make me skeptical of our government’s vocal commitment to “evidence-based policy”, as refreshing as it sounds.

On a related note, I was disappointed to see that neither James Heckman (State of Econometrics) nor Susan Athey (Machine Learning & Causation) made the Top 10 for 2016. Both were excellent interviews on this fascinating topic!

Nicholas
Mar 21 2017 at 7:20am

It is fantastic that you were able to get Andrew onto the show. I found the exposition of many issues such as the ‘garden of forking paths’ to be perhaps even clearer than when I have read about them.

However, as usual, I found your discussion of your own views about empirical work to be very frustrating. You say that you are looking for balance. Perhaps, but the view that you seem to communicate seems much closer to epistemic nihilism with respect to social and biological science.

I also find it extremely difficult to square with your claim that your position leads to humility. What it actually reminds me of is the positions of post-modernists who I dealt with at university. On the one hand, knowledge is impossible, but nonetheless, the world works in x,y,z ways and we should do a,b,c.

That you are not really representing a ‘moderate’ view seemed even more stark in comparison to Andrew. Despite his incessant criticism of standard scientific practice, he presents a positive view of what should be done, and distinguishes clearly, with reasons, what is worthwhile and what isn’t. You never really talk this way, but operate in the realm of generalities.

Let me end on a positive note. While I find your views on statistics and research (among other things) to be frustrating, you still produce some of the most interesting and useful podcasts available.

Russ Roberts
Mar 21 2017 at 1:22pm

Nicholas,

I believe the minimum wage hurts low wage workers. I have evidence for that claim. Some of it is statistical. Some is more what might be called meta-evidence–an understanding of how employers respond to higher costs in other contexts and settings. How confident should I be in my belief. Yes, there is statistical evidence on the other side. How reliable is it? How robust? Can it be verified? Not really. So where does that leave me? That leaves me mildly confident that I’m right. Very mildly. If you want to call that nihilistic, feel free.

You mentioned biological science. I think there is a lot of science in biology but not so much in the modern state of epidemiology. This summary article should give anyone pause. Should we give up? No. But we should not be confident.

Jason Ward
Mar 21 2017 at 5:12pm

Hi Russ,

I am a regular listener and enter the comments section fray with some trepidation (it is rarely a constructive exercise in my limited experience). However, I wanted to just add a few thoughts that build on some of the back and forth above.

I am a (considerably) older person who left a relatively successful career to enter economics training by going back to community college with about 24 hours in the bank, carrying on through finishing a UG program in math and econ, and entering a grad econ program. I am currently a candidate plugging away at research.

Based on my own recent experiences, I think that some of your introspection about your training may be fairly unrepresentative of the way things are taught these days. What I have gained from my formal training and interactions with faculty in seminars and informal discussions is that skepticism is warranted almost anytime. The current paradigm of internal versus external validity is a formal manifestation of this general tendency.

My formal training has had an overarching focus on awareness of potential biases and the absolute limitations of ever ameliorating them completely. In this sense, I have been trained to view any individual study that appears to have been careful and thorough about the potential for bias and the extent of its validity as one piece of suggestive evidence contributing marginal understanding about a broader economic question.

Furthermore, to your (regular) point about no one really changing their minds due to empirical research: I entered training in economics as a pretty left-leaning person politically, and the experience of being subjected to a litany of research has served to move me quite far towards something like “the center” as I have had to reconcile my priors with thoughtful and compelling research that has been hard to totally discount. So I see well done research and my posterior beliefs move a little. Many guests have tried to make this same point with you but it seems like it’s very hard for you to move from your priors. But this is (in my opinion) what the whole process of doing social science is supposed to be about.

I had just read this piece in the JEP on the peer review process, perhaps you have come across it? https://www.aeaweb.org/articles?id=10.1257/jep.31.1.231
I think this piece spells out many problems in the field of academic publishing and is quite thoughtful about how to address them, but the one claim that it makes that I think is germane to the inflammatory proposition you made in this interview (“I’m tempted to reject all of these findings in economics, epidemiology, social psychology…” etc.), that there is no place in the formal process of generating social science research for comments like “I just don’t believe it” or “It doesn’t pass the smell test.” The type of critiques Gelman levels at econometrics is the right way to move forward, using scientifically sound critiques to raise issues about the quality of research. Your extreme example of just totally discounting empirical research and relying on your gut suggests there is no role for marginal progress in the social sciences. What is the equilibrium for academic research in a world where referees just get to “not believe it?”

Furthermore, I think contemporary political developments totally discounting the idea of facts and objective reality portend a really poor future for our society on myriad levels, so it is kind of painful to hear you move from your normal heavy skepticism to what I would characterize (as a longtime listener) as just disdain. It’s a slippery slope.

It is, of course, your show in the end but when I hear interviews where you are a thoughtful, skeptical voice and you manage to walk the tightrope between being an advocate for your own ideology and being a careful interviewer, I really enjoy the show. In this Gelman interview it seemed like, for whatever reason, you let yourself off the leash. The interview was a lot more like listening to a pep rally for a single point of view and it was really a lot less enjoyable for it. Gelman seemed like the source of restraint and that seems like that’s not the proper dynamic for econtalk.

Just some food for thought I hope. I appreciate the show and have learned a lot from listening.

Respectfully,
Jason Ward

Phil
Mar 21 2017 at 5:23pm

Dr. Roberts: On an abstract level, it is of course always better to be humble about we can know, but this applies equally to any method of inquiry. What I struggle to understand is what it is about statistical analysis specifically that you are sceptical about.

You mention social scientists sitting in classrooms, unaware of the fundamental limitations of the tools they learn about. First, this criticism could be levelled at any introductory class in any field whatsoever: basic introductions always paint a rosy picture of the field, and are light on details. If you study Newton’s three laws in high school, you are not encouraged to spend much time thinking about what an “inertial reference frame” really is, whether Newton’s first law follows from the second, or whether the concept of a rigid body really makes good sense, even though these are quite fundamental questions. Second, I am quite sure that if the social scientists you mention learn from the works of Athey, Gelman, Heckman, Angrist, and other econometricians you have had on this show and who have written popular texts or reviews, they will get a sense of how fickle statistical analysis is.

Another point you make is the large number of incorrect results that are published or announced. This seems to me quite distinct from the use of statistical analysis. Maybe people should be more careful about what they publish. Maybe this rate of publication is fine, and researchers just need to more sceptical about new results. Or maybe the current “failure” rates and the scepticism level of researchers are just fine – as I noted above, it takes a lot of failures to get even modest results. Maybe popular publications and newspapers should be more careful about picking up stories based on very recent publications that announce breakthroughs. None of this seems to have anything to do with statistical analysis.

Kevin
Mar 22 2017 at 2:31pm

I am a lapsed epidemiologist. During my training I realized most studies were terrible, found effects that were contradicted rather than replicated, and that progress was not clear.

As a shorthand I always consider the lower bound of the confidence interval the maximum effect and then generally consider studies as only preliminary. However, we do learn things without randomized studies- the effects just have to be huge and consistent. Smoking causes cancer and cardiovascular disease. Obesity leads to negative health consequences. There are a few other obscure ones – usually causes of cancer. But they are real and are progress for human health without human experiments.

Thankfully epidemiology has a strong corrective – randomized trials – that can test theories. Even they don’t always agree given nuances of the design, but broadly answer important questions.

The social sciences are near hopeless by comparison with how hopeless we are.

I generally agree with Phil – who makes an excellent point about how messy science is – but I also point out that there is currently no reputable scientist who believes the sun revolves around the earth, smoking is healthy, reducing hypertension increases heart attacks, or that radio waves cannot be used to broadcast audio information. However, in the social scientists there is little broad consensus on many questions and even some topics about which there is normally broad consensus are thrown out for political purposes (minimum wage or now even more horrid free trade).

Gelman’s reminders are important and highlight the problems with replacing data with assumptions and models. But fundamentally harder sciences are generally progressive – over time we are accumulating consistent knowledge. Occasionally it all gets rewritten but in the mean time we can apply it successfully in the real world. Newtons rewrite did not make trains no longer work. Technology advances (boosted by the harder sciences) while the social sciences remain mired in debates.

Phil is right, 95% of all the science in all the fields being done right now is meaningless and makes no long term contribution. If only we knew which 5% was important…but we don’t until we look back. So we forge on in the dark. But at least I see light at the end of the tunnel for many sciences.

Robert Swan
Mar 22 2017 at 7:04pm

Always good to get a dose of reality about statistical methods. The image of the “garden of the forking paths” was great. I think Prof. Gelman slipped when he defined p-hacking as hacking the data until the desired p value is reached. Most would see directly hacking the data as out-and-out fraud, not mere self-deception.

Ernest Rutherford is reputed to have said that if your experiment requires statistical analysis to see a result you should design a better experiment. I guess that’s easy for a physicist to say, but there are plenty of results today — deemed significant — where there is nothing at all to be seen on a graph. Better experiments needed.

When I did my degree in applied mathematics, statistics was prominent, but taught more or less as a cookbook of methods. I can’t say I liked it. Near the end of the course I enrolled to study Statistical Theory; I hoped it would fill in the foundation that I felt was missing. Perhaps tellingly, the subject was cancelled. Apparently there had been only one applicant.

Since then I have read enough to appreciate that statistics is, at its heart, a very solidly derived branch of pure mathematics. For me, having a clear understanding how the various distributions (Chi-Squared, etc.) were derived made their limitations clear too.

The biggest problem is that far too many people have only had the “cookbook” training. Sure, their courses mentioned things like “uniform variance” and “independence”, but those are just window dressing (like all that nonsense about commutativity in high school). The real meat was the calculation. Add in today’s cheap computing power and you end up with a bunch of people thinking they’re doing statistical analysis when they’re actually just doing numerology.

Perhaps the computer can also be part of the solution. Rather than actually understanding the fairly heavy maths behind statistical distributions, you can achieve the same effect by using a computer to simulate the experiment with random numbers. A few simulation runs will show whether your result might have happened by mere chance. When the WHI breast cancer result hit the headlines I analysed it for my own interest by traditional methods. Took a fair bit of thought. More recently I wrote a simulation in the R stats package. It only took me a few minutes to write, and a few seconds to run. It gave much the same result — that the p value was right around 0.05 and hardly made a compelling case to terminate a decades long study.

A number of commenters are taking Russ to task for “nihilism”, “disdain”, etc. I don’t think this characterises Russ’s stance. Remember, he asked if statistics as it is taught is just providing “a cudgel, a club, a stick with which to beat your intellectually inferior opponents”. Who hasn’t seen statistics used in exactly that way? Is it not a good reason to doubt anything justified by statistics alone — especially in politically polarised areas?

Topper Kain
Mar 22 2017 at 7:48pm

Great discussion of some of the fundamental issues of science and “what we know.” One thing I wish had been discussed more is why .05 is the threshold, and what might happen if that threshold was moved to a different level, say .005. Wouldn’t having a much more difficult threshold limit the number of forking paths that result in the desired outcome? The .05 p value threshold is not consistent between sciences. I know parts of particle physics actually expect p-values in the 6 sigma range, or a p-value of roughly .00000001 (give or take a 0) to consider a discovery confirmed. I’m sure many social scientists would scream “it is impossible to reach those p-values in our field or budget” I would argue that what would actually happen is that social scientists would be come much more judicious in the research, pooling research funds and carefully (and collaborative) designing and executing experiments.

Alternatively, social scientists could start designating results with different levels of “significance”- a p value of .05 could be “minimally significant”, a p value of .005 could be “significant”, etc… Additionally, encouraging the reporting of replication status could diminish effect of tentative results. Saying “these results are minimally significant but unreplicated” is a much less convincing to a lay audience then “these results are statistically significant”

Jonny Braathen
Mar 23 2017 at 12:57am

Russ
Maybe you could help explain why so many economists are critical you science. Just look at the history, lead, smoking, evolution and climate change.

You yourself has called the scientific theory of evolution just a theory (by the way it is a scientific theory and a fact).

You wrote in support of the creationist Berlinski

“People often speak about the theory of evolution as if it were a “fact” or “proven.” Alas, it is only a theory, a useful way of organizing our thinking about the real world. When theories stretch too far to accommodate the facts, a paradigm shift is usually forthcoming. As Mr. Berlin-ski notes, biologists have trouble imagining an alternative paradigm to evolution. This may explain the vehemence with which they greet criticism. Perhaps they too are uneasy with the emperor’s wardrobe.”

Maybe you should disclose this when you discuss science.

Nicholas
Mar 23 2017 at 8:03am

[Comment removed pending confirmation of email address. Email the webmaster@econlib.org to request restoring this comment and your comment privileges. A valid email address is required to post comments on EconLog and EconTalk.–Econlib Ed.]

Steve Poppe
Mar 23 2017 at 10:22am

Russ,

I enjoyed Andrew Gelman’s talk, but it was confined to small samples. In this age of “big data” researchers increasingly use huge samples. The point’s been made that with a large enough sample even miniscule differences become “statistically significant” – another reason to take p levels with a grain of salt.

Now combine that idea with the point you have often made that the world is a complicated place. A kind of corollary is that in any two human subpopulations there are differences on almost any measure. So Belgian-Americans are over-represented as beer brewers, and under-represented as rabbis. No harm in that but wait.

Enter the notion of disparate impact, which seems to be the theory that any statistically significant difference in a metric someone cares about, and that negatively affects some group someone cares about, is a bad thing, and should be remedied by social policy, specifically government action.

Do we not now foresee an interminable flood of disparate-impact claims stemming from big data and p.05 significance levels?

It would be illuminating to have a guest on EconTalk explore the nexus of big data and disparate impact.

And a short comment on humility: It is in the area of making public policy recommendations that a good dose of humility among experts is most needed. As Nate Silver has said in his book The Signal and the Noise, “Both experts and lay people mistake more confident predictions for more accurate ones. But over-confidence is often the reason for failure.” If the best economists are still arguing about what caused the Great Depression, and if they cannot convincingly explain productivity, who are they to make sweeping recommendations on national economic policy affecting trillions of dollars of wealth and output?

Steve

PS Love EconTalk!

jw
Mar 23 2017 at 12:08pm

I hope that listeners appreciate what a fantastic podcast episode this was. Notes:

– Extreme skepticism should be the default position for any statistical evaluation of human behavior. Physics follows laws, humans – not so much.

– Not mentioned, but crucial to the evaluation of the validity of the hypothesis, is out of sample testing. For instance, was the Jamaica study followed up on 5, 10, 20 years later? Were the higher income subject STILL higher income? There is tremendous income mobility in the US (and possibly Jamaica) which would allow some subjects to temporarily be in a higher bracket.

– Physics is replicable, so six sigma is a reasonable standard. If the social sciences used this standard, publishing would cease (and grants and awarded PhD’s would also cease). The replicablity crisis is real in the social sciences. Some of this is due to the expense, but I feel that most is due to the lack of academic prestige associated with “merely” validating other work.

– There is no question that the mathematics behind statistics are sound. However, alpha (the “p” in p less than 0.05) and statistical power are just generally agreed upon values, they are not holy writ. There are also no absolute standards for degrees of freedom. Also, the construction of social experiments are very loose compared to physics, as discussed. So when reading a peer reviewed article, the math is generally safe to skip, that is the easy part to catch in the review process. The assumptions, set up and conclusions are where the fun is (assuming that you get a kick out of finding confounding variables).

– The discussion about p-hacking was great. If in each case the researchers followed the classic “develop null hypothesis, collect experimental data, test data” and failed to see the p less than 0.05, they should start over completely with new data as their mere exposure to the existing data will bias their development of a new hypothesis. But this is time consuming and expensive.

– With modern computing power, researchers can look at bazillions of data points and extract thousands of correlations that are statistically significant. This makes developing successful hypotheses and publishing papers so much easier as you are pretty sure of the results ahead of time (but that might be wrong, so I am sure that no one actually does this).

– Also not covered is the “drawer” bias. When researchers get results that conflict or even contradict their previous work or widely held biases, the study disappears into a drawer never to see the light of a journal.

– Pharmacology is not immune. For instance, statins are widely prescribed. They definitely reduce cholesterol. But that is NOT the same as reducing heart attacks or mortality. Cholesterol is a proxy, and new research shows that it is a poor proxy (in most cases, for some rare genetic diseases statins are a life saver). There is some evidence that statins help to prevent recurring heart attacks in men with prior heart attacks, but have almost no value in men without a history of CVD or over 60 or any women.

Here is the kicker – while statins have some effect in reducing heart attacks, they have NO effect on general mortality. For some reason, you don’t actually live longer, you just have fewer heart attacks (you die of other causes). You will not see this highlighted in the TV commercials.

Also not mentioned (these are not criticisms, just ideas for follow up), is the concept of “numbers needed to treat”. The small increases in percentage effectiveness times the number of actual instances sometimes turns out to be a very large number. For instance, the number of 60 year old women needed to treat with statins over five years to save ONE life is 300,000. That’s 500M doses of statins to save ONE life.

– Out of sample testing is crucial to the financial industry. If your “model” doesn’t work, you can quickly find this out by checking your forecasts vs actual data (granted that there is no set standard for validating models as well). This also applies to the Fed, with legions of math and statistics PhD’s. Their one year ahead GDP forecast is known to be inaccurate (granted, a difficult problem), but it is ALWAYS higher than the resulting actual GDP. One would expect that if a statistical model exhibited these tendencies over four samples a year for 10 years, someone would have corrected the model by now to have a more random error term. Alternatively, one could conclude that the model’s purpose is not really to accurately forecast GDP, although trillions depend on it.

– Thank you for introducing me to Dr. Gelman’s blog, it is very entertaining, even for a dog person (although I must admit that I may not visit it much as my wife says that if I get any geekier we will never be invited to any more dinner parties).

Greg Alder
Mar 23 2017 at 8:31pm

Russ, I thoroughly appreciate your skepticism.

Russ Roberts
Mar 24 2017 at 9:53am

Jonny Braathen,

Happy to “disclose” that I think evolution is a theory. Theories are generally confirmed or refuted by evidence. They are never proven.

Theories aren’t facts. They are theories.

Evolution is like much of economics–a way of organizing our thinking about the world. In economics, the theory of demand posits that people buy less of something when it gets more expensive, other factors held constant. If I see purchases increasing in the face of a higher price, I don’t reject the theory. I look for the other things that might have changed. If nothing has changed (almost possible but bear with me) then I have to consider that the theory no longer holds or that this is a special case. Sometimes we will find a special case such as Giffen goods that do not refute the theory but require us to allow for this example or another. If special cases start to accumulate or if new facts come along that the theory struggles to explainthen eventually, the theory is usually replaced by something more parsimonious that is often more accurate. Ptolemy’s geocentric theory of the solar system would be an example.

The theory of evolution has trouble explaining some phenomena. The Cambrian Explosion would be one example. It certainly doesn’t refute evolution but it is a challenge of sorts to the theory. The possible explanations are similar to the response of economic theory to the challenge of people buying more of something when it’s price rises–the theory helps you look for other factors that may have been ignored. In this case–dramatic changes in the environment for example. We would then look for other kinds of evidence that would confirm or reject these changes.

As far as I understand it, we do not have a viable scientific alternative to Darwinian evolution. (Though crazy as it seems, there is some evidence now that Lamarck was not totally wrong.) Creationism is not science. It is a faith-based explanation of the world around us that is not amenable to verification using the scientific method. It’s not refutable. That doesn’t mean it’s false. It’s just not science.

Philosophers such as Nagel and Chalmers (who are both atheists by the way) have suggested that the failure of evolutionary theory to explain what Chalmers calls “the hard problem of consciousness” means that we will need a new theory of the origins of life. Here’s a New Yorker article on one of Nagel’s books.

I don’t know if a paradigm shift is coming for biology. We’ll see. My point in the letter you quoted and my point more generally, is that sometimes scientists (and economists) become so attached to their theories that those theories become dogmatic beliefs–they are no longer refutable. Happens all the time in economics but it happens in real science as well.

jw
Mar 24 2017 at 5:21pm

Interesting points, Russ.

Most people do not realize that gravity remains not only a theory, but our least understood fundamental force. In the same vein, they think that our DNA is constant throughout life, bit it isn’t, it is mutating and the telomere are getting shorter all of the time. On top of that, how our genes are expressed within us, epigenetics (what they “switch” on and off), are constantly changing as well, often due to environmental stimuli.

But be careful with dismissing “creationism”, there are different types. I personally believe that God created the universe, but in 13.4B years. I have spent a lot of time reading the science and God is actually a higher probability than the infinite number of infinities that are required by current science to explain how we got here. A person can choose not to believe, but they shouldn’t say it is because they “believe in science”. The universe is a very weird place, and much, much more unlikely than God, yet here we are.

So instead of relying on mathematical constructs and models of human economic behavior, I have a great deal of respect for von Mises point of view that we should look at the incentives and logic behind economics and not try and over-mathematize it (ala Samuelson). (I just wish von Mises had picked a better term than “praxeology”…)

Jonny Braathen
Mar 25 2017 at 2:17pm

Russ

Evolution is both a scientific theory and a fact.

Only creationists or reader/followers of creationists blogs/books/speakers think the Cambrian Explosion is a problem for evolution.
Of course, no one of these people is scientists participating in submitting articles to journals or promote their views on science conferences.

Interestingly you see the same behavior among the so-called climate skeptics. A group that free market economist and think tanks love. No active research, mostly blogging, going on FoxNews or any activities other than real science.

I guess Alfred Nobel was right. Science follows the scientific method and through hard work create scientific theories that can make predictions.

Economists and economics together with other social sciences are making assumptions and just theories.

jw
Mar 25 2017 at 3:00pm

Jonny Braathen,

One must be careful about definitions. Do organisms evolve in at least some characteristic? Yes, that is a fact. Have organisms evolved over billions of years to create all current life forms from a random process abetted by natural selection? That is a theory. Are only creationists concerned about some aspects of Darwinian evolution, no, this is not a fact, evolutionary scientists are concerned as well.

You are much further off base when it comes to AGW. As in other Econtalks, I have pointed out that there is absolutely no way to make a valid forecast on century based timeleines for geological based time periods. Current AGW forecasts must be accurate (with NO revisions) for several centuries before one can even begin to rule out natural variations (the statistical noise in this week’s podcast). And considering the failure of the initial twenty years of long term forecasts, they don’t inspire confidence.

Another concern is the requirement to “adjust” historical data (not mentioned this week, but always a red flag when looking at statistical work). I have read the original AGW papers on these adjustments and they are not valid, they make some pretty basic errors. Again, the math may be exact, but the assumptions, set up and conclusions were extremely biased.

Rodney A Miller
Mar 25 2017 at 4:48pm

This episode confirms my bias that econtalk is one of the best podcasts in the podcast universe.

At risk of degrading the superb episode and followup comments (being the comment shark jumper), I offer the following.

I wonder if the problems of study design, statistics and conclusions can be improved by having a science certification like an organic certification. I think we would agree if Gelman or his trained reviewers analyzed studies and placed the reviews along with the published results it would encourage better study design and statistics. The certifiers could provide the caveat about concretizing and over inflating the meaning of any one study in all of their reviews. This would increase the cost of science but maybe a smaller number of better quality studies particularly in social science and economics is called for.

Secondly, as Steve Poppe is suggesting we will have better data in the future. We should be planning on population based statistics because it is just a matter of time before the economy or health will have every transaction or health incidence measured. Combining bank and point of sale data is possible now. The change of a public policy or dietary input will be undeniable because the change can be reversed and measured.

My example is working in local government and arguing about the use of an unique approach to providing a service. Different cities and counties used different approaches to providing the same service. I and my peers had access to every resident’s address and use of the address was required to provide the service or record service participants. We were without dispute able to say which approaches worked better and local governments copied the more effective approaches and had higher levels of participation while providing the same service over time.

Nate
Mar 27 2017 at 1:23pm

Maybe a good way to think about the original priming study that spawned many “conceptual replications” is as a very powerful blueprint for producing publishable research. Ultimately it could provide a lot of value by illuminating in relief the differences between what’s publishable and what’s likely to be true. (Apologies if this is obvious.)

Daniel Barkalow
Mar 28 2017 at 3:11am

Say you flip a coin 5 times, and get all heads. The probability of this happening by chance with a fair coin is 1 in 32, so p < .05. So now you want to know how big an effect you have. What you get is that it was heads every time, so that’s what you report as the effect size. Your dinner party conversation goes that you’ve got a statistically significant result that your coin never lands on tails.

Obviously, this is completely wrong. (1) If you’d gotten all tails, you would have had p < .05 that the coin isn’t fair in the other direction; if you didn’t pick a hypothesis in advance, you’ve got a 1 in 16 chance of finding the result notable. (2) Even if you hypothesized in advance that the coin was biased toward heads, the most you can say as far as effect size is that, if the coin gave heads 54.9% of the time or less, the outcome you got would happen by chance only 1 time in 20; your original p value is for the proposition that the coin is biased, not that it always gives heads. Because your sample size is small, you can’t say much: your outcome wouldn’t be surprising with a coin that was only a little unfair.

Of course, if you then claim the coin gives 55% heads, and someone wants to replicate this finding, they need to design an experiment with could produce one of two results: (1) the outcome they get would only happen with a >55% heads coin 1 in 20 times, so your finding was wrong; (2) the outcome they get would only happen with a <55% heads coin 1 in 20 times, so your finding was right. (And they ought to account for the two hypotheses.) Just doing what you did again, even if the coin is, in fact, 55% heads, is pretty certain to give no statistically significant result, because the original study was underpowered.

In any case, any time this topic comes up, I want to cite the wonderful poster “Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction”. It’s half an explanation of how statistical analysis can go wrong, and half a psychology experiment on a dead fish.

Jesse C
Mar 28 2017 at 8:13pm

Great episode, but some of the more critical comments are unfair.

In particular, Jason Waard – You quoted Dr. Robers (“I’m tempted to reject all these findings…”) as if he stated this as his approach to such research. Well, at least you included the word, tempted. Why did you fail to quote this in its proper context – that is, as one of two alternative approaches. The rest of that statement that you omitted included:

…or, do I say, ‘Well, I’m going to keep an open mind. Some of them might replicate. Some of them might be true. And if so, how do I decide which ones?’ Help me out, Doctor.

That quote was out of context, so I disregard your points directly predicated upon it. Indeed, it makes me skeptical if the rest of your argument.

John DeMarco
Mar 31 2017 at 4:04pm

This was captivating – particularly for a mediocre math student like me, who regrets never taking the time to fully apprehend statistics. Too much human activity in the “civilized” world reacts to statistical persuasion to not question the theories underlying the assertions.

This was another EconTalk episode that got me in trouble for sitting in the car to finish the last 15 minutes while a late dinner with my wife waited. (But it was worth it!)

Thank you for such a wonderfully engaging series.

Rodney A Miller
Apr 5 2017 at 4:51pm

In the excellent podcast Conversations with Tyler “Ep. 16: Joseph Henrich on cultural evolution, WEIRD societies, and life among two strange tribes” Henrich points out how different WEIRD Western educated industrial rich democratic cultures are from worldwide means. He says:

“We got to talking over lunch and we realized in each of the areas for which we were experts on, that Westerners were unusual compared to all the other populations that had been studied.

We thought this was interesting and we began to compile all the available data we could find where Western populations were compared against some larger global sample. What we found, in not every but a large number of important domains in the psychological and behavioral sciences, that Westerners were at the extreme end of the distribution.

This made us wary and I think it ought to make lots of people wary about the typical textbook conclusions that you would find in psychology textbooks. Much of behavioral economics, at least at the time, was based on running experiments on undergrads. It’s actually mostly American undergrads that are studied.”

[URL for the quote is https://medium.com/conversations-with-tyler/joe-henrich-culture-evolution-weird-psychology-social-norms-9756a97850ce –Econlib Ed.]

Dr. Duru
Apr 14 2017 at 12:31pm

I *think* I understand the problem, but I would have preferred more discussion of the proper methods of data analysis. Can you bring Gelman back specifically to focus on the GOOD practices?

Comments are closed.


DELVE DEEPER

EconTalk Extra, conversation starters for this podcast episode:

This week's guest:

This week's focus:

Additional ideas and people mentioned in this podcast episode:

A few more readings and background resources:

A few more EconTalk podcast episodes:


AUDIO TRANSCRIPT

 

Time
Podcast Episode Highlights
0:33

Intro. [Recording date: March 6, 2017.]

Russ Roberts: My guest is author and blogger, Andrew Gelman, professor of statistics and political science at Columbia University. Andrew is a very dangerous man, for me. As listeners know, I've been increasingly skeptical over the years about the reliability of various types of statistical analyses in psychology, economics, epidemiology. And coming across your work, Andrew, which I've done lately in reading your blog, you've confirmed a lot of my biases. Which is always a little bit dangerous. But in such an interesting way. So, I'm hoping to learn a lot, I'm hoping, in this conversation, along with our listeners. And at the end we'll talk about whether I've gone too far and become too comfortable. So, I want to start with something--the other point I want to make before we start is it's sometimes hard to talk about statistics and data over the phone. And in a podcast. We're going to do the best we can, without a whiteboard. But I'm hoping that both beginners and sophisticated users of statistics will find things of interest in our conversation. So, we start, though, with something very basic. Which is: statistical significance. And we are going to wonder about statistical significance in small samples. But let's just start with a definition. When economists, or psychologists, say, 'This result is statistically significant,' What do they usually have in mind?

Andrew Gelman: Statistically significant means you are observing a pattern in data, which, if there was actually nothing going on in the data were just noise. The probability of seeing a pattern at least that extreme is less than 1 in 20.

Russ Roberts: And that 1 in 20, so, it's almost surely 95% of the time not due to the randomness of the noise. That's an arbitrary cut-point that has somehow emerged as a norm in published academic research. Correct?

Andrew Gelman: Correct. But you state it a little wrong. One of the challenges of statistical significance is it kind of answers the wrong question. And one way to see that is that when people define it, they tend to get the definition garbled. In fact, I have a published article where, I think it's either the first or second sentence completely garbles the definition of statistical significance. That was our article about the Garden of Forking Paths.

Russ Roberts: Okay, we're in good company here.

Andrew Gelman: Yeah. I blame the editor of the magazine, because they edited the thing. And of course it's certainly not my responsibility what comes out under my name.

Russ Roberts: No.

Andrew Gelman: I don't think I should ever take responsibility for that--

Russ Roberts: No, absolutely not.

Andrew Gelman: So, statistical significance does not say that your result is almost certainly not due to noise. It says: If there was nothing going on but noise, then the chance is only 1 in 20 that you'd see something as extreme or greater. And it's typically illustrated in textbooks with examples of coin-flipping. Um, like, if you flip a coin a hundred times, it's very unlikely that you'd see more than 60 heads. Or more than 60 tails. So, if you saw that, you'd say, 'Well, this is kind of odd. It doesn't seem consistent with my model, in which nothing's going on.'

Russ Roberts: Well, in that case, you'd presume--your model is that it is a fair coin, with a 50% chance of heads and a 50% chance of tails. So, if you consistently got 70 or 80 or 90 you'd start to wonder whether your model was, your assumption was correct. But--correct me if I'm wrong--1 in 20 times you will get more than 60 heads. So that, if that might just happen to be one of the times that that happens. That will happen, actually, 5% of the time, is what you are saying. Correct?

Andrew Gelman: Yeah. It's not literally 1 in 20 that you'll get more than 60, because those numbers were just approximate.

Russ Roberts: Sorry.

Andrew Gelman: No, no, I was just pulling--I mean, there's some cut point. Exactly. But. Right. So, if you are living in a world where sometimes you have random number generators and you are trying to say, 'Are the results consistent with a certain random number generator?' then the statistical significance test, it's doing that for you.

4:54

Russ Roberts: But that's not how we use it in empirical work. So, your more accurate definition than mine was, I think, had at least two negatives. So, I was somewhat confused by it. You are someone who has written papers with statistically significant results, I think. So, let's go back to that more accurate definition. I run some analysis. I run an experiment. I have a hypothesis about what I'm going to find. I may come to that hypothesis after the fact--we'll come to that later. But I have some--in my published paper, I have a result. And it says that this claim, about this, say, difference or impact of some variable on something we care about--a public policy change, might be the minimum wage, it might be immigration, it might be male/female differences--this difference or this effect is statistically significant. And when people write that in published work, what do they have in mind?

Andrew Gelman: What they are saying--should I give an example?

Russ Roberts: Yeah; sure.

Andrew Gelman: So, a few years ago some economist did a study of early childhood intervention in Jamaica. And about 25 years ago, they went, they gathered, about I think it was 130 4-year-olds in Jamaica, and they divided them into a treatment group and a control group. And the treatment group--they--the control group--they did some helpful things for the kids' families. The control group, the treatment group, sorry--they had--a fairly intense 1-year intervention with the parent. Then they followed up the kids, 20 years later they looked at the kids' earnings, which is--earnings is this quaint term that economists use to describe how much money you make.

Russ Roberts: Yes.

Andrew Gelman: And they like to call it how much you earn.

Russ Roberts: Yes.

Andrew Gelman: And it turned out that the kids, in the treatment group, had 42% higher earnings than the kids in the control group.

Russ Roberts: And the intervention was when they were 4 years old--

Andrew Gelman: When they were 4 years old. And the idea--I mean, there is some vague theory behind it, that this is a time of life where, if the kids can be prepared it can make a big difference. It's controversial. There's people who don't believe it. So they did this study. And the did the study--we'll come back to the study. It's a great example. So, they found an estimate of 42%. And it was statistically significant. So, the statement goes as follows: Suppose that the treatment had no effect. So, suppose that they were giving, not even placebo. Like, just nothing. Like there was no difference between getting leaflets or whatever they were getting and getting the full treatment. Zero effect. There's still going to be randomness in the data because some kids are going to earn more than other kids, as they grow older. So, if you have no--if it's completely random and the treatment has zero effect whatsoever, not any effect at all, not a placebo, not nothing--then, you'd expect some level of variation. And it turns out that if you see an effect with those data, if you see an effect as large as 42%, there's a less than 5% chance that you see an effect that large just by chance.

Russ Roberts: Which encourages you to think that you found something that works.

Andrew Gelman: Right. It says that, well, here's two stories of the world. One is the treatment, really works; it's helping these kids. Another story of the world is, it's just random; it's just fluctuations in data. Statistical significance tests, the p-value rules out the hypothesis, seems to rule out the hypothesis that it's just capitalizing on noise. Now, it doesn't really, for reasons we can get into--

Russ Roberts: [?] want to--

Andrew Gelman: Right. That's an example. And I can give you another example, if you want.

8:41

Russ Roberts: I want to stop with that one. Because I just want to--before we go any further, I just want to talk about my favorite, one of my favorite things that I dislike. And I want to let you react to it. So, let's say I'm at a cocktail party; someone says, 'I think we should spend more on preschool education.' And I say, 'Well, it might be good. I don't know. It probably has some benefit. It depends on what it costs, and how you decide what's in that education.' And then the person, the other person says, 'Well, studies show that preschool education has a 42% impact on wages, even 20 years later.' And that 42% is really--I'm sure the actual number is even more precise than 42--it's 42.something. And now the burden of proof is on me. This is a scientific result. There was a peer-reviewed paper. And in fact as you point out, and I did my homework before this interview, one of the authors of that paper is Jim Heckman, who has got a Nobel Prize in economics. He's been a guest on EconTalk, to boot. And you're going to tell me that that result isn't reliable? So there is a certain magicness to statistical significance that people do invoke in policy discussions and debate.

Andrew Gelman: Yes. There is a magic. I think that your hypothetical person who talks to you at a cocktail party, I would not agree that the evidence is so strong. I wouldn't want to personalize it with respect to Jim Heckman.

Russ Roberts: Yeah, neither would I. He's a good man--

Andrew Gelman: What he's doing, he's doing standard practice. And so, I wouldn't--I would hope that he could do better. But this is kind of what everybody's doing.

Russ Roberts: So, what's wrong with that conclusion? Why might one challenge that 42% result? Which, as you point out, only 132 observations--and this is where one of my biggest 'Aha!' moments came from reading your stuff. Because I would have thought, 'Wow. Only 130 observations'--that means you have to divide it in half: 65 and 65 roughly. I assume. Maybe people dropped out; maybe you lost track of some people. But you have a small sample, and you still found--that's very noisy, usually. Very imprecise. And you still found statistical significance. That means, 'Wow, if you'd had a big sample you'd have found even a more reliable effect.'

Andrew Gelman: Um, yes. You're using what we call the 'That which does not kill my statistical significance makes it stronger' fallacy. We can talk about that, too.

Russ Roberts: Yeah. Explain.

Andrew Gelman: Well, to explain that we need to probably do more background. So, let's get to that. But let me say here that there are two problems with the claim. So, the first problem is that, it's not true that if nothing were going on there's less than a 5% chance that you'd see something so extreme. That's actually not correct. Actually, if nothing's going on, your chance of finding something "statistically significant" is actually much more than 5% because of what some psychology researchers refer to as 'researcher degrees of freedom,' or 'p-hacking.' Or, what I call the Garden of Forking Paths. Basically, there are many different analyses that you could do of any data set. And you kind of only need one that's statistically significant to get it to be publishable. It's a little bit like you have a lottery where there's a 1 in 20 chance of a winning ticket, but you get to keep buying lottery tickets until you win. So that's half of it. The other half, that the estimate of 42% is an over-estimate--in statistics jargon, it's a biased estimate. So, when you report things that are statistically significant, by definition, or by construction, a statistically significant estimate has to be large. Under typical calculations, it has to be at least 2 times, 2 standard errors away from zero. And the standard error is something calculated based on the design. This particular study had, maybe, maybe had a standard error of about 20%--so, 42% is two standard errors away from zero. Because it's selection bias, the only things that get--I wouldn't say the only things--

Russ Roberts: yeah, almost--

Andrew Gelman: typically--well, people do publish studies where they say, 'Hey, we shot something down.' Okay?

Russ Roberts: Yep.

Andrew Gelman: But when studies are reported as a success, they are almost always reported as statistically significant. So, if you design a study with a small sample that's highly variable--as of course studies of kids, and adults are: people are highly variable creatures--when you design a study that's highly variable, in a small sample you'll have a large standard error. Which means any result that's possibly statistically significant has to be large. So, you are using a statistical procedure which, a). has more than a 5% chance of giving you a positive finding even if nothing is going on, and b). whether or not something is going on, the estimate is going to be over-estimated. So that, I don't believe that 42% number because of the procedure used to create it.

14:00

Russ Roberts: Now, the people who did the study were aware of some of these issues: in fact, aware of all of them, really. One of the ways they avoided--one of the ways the tried to keep the standard error--the imprecision of the estimate that's inevitable with a small, finite sample--is they--the parents of these children were chosen to be somewhat similar. Right? They were low-income parents. Now, low-income parents yield high-income children sometimes. And sometimes not. As you point out, there's a lot of variation. It just has nothing has nothing to do with the treatment effect. And everyone understands that. So, when you say that the standard error is 20%, it's a way of saying in statistical jargon that: Of course, there's going to be some--even if we had all the children come from parents with the same income--literally the same--the same within a few dollars--they'd still have variation as they grew up because of random life events, skills, things you can't observe, things you can't control for. And so, as you point out, you are trying to say, well, but if it's 42%, that must be really big. And your point is that, well, it kind of had to be or it wouldn't be published. So, what we're observing is a non-random sample of the results measuring this impact. Is that a good way to say it?

Andrew Gelman: Yes. And let me say--I don't know that the standard error was exactly 20%, in case anyone wants to look that up--

Russ Roberts: No, no--

Andrew Gelman: I just know that 42% was statistically--so, I'll tell you a few things about the study. So, some of the kids went to other countries. And I think they on average had higher incomes. And the percentage who went to other countries was different for the treatment and control group. Now, that's not necessarily a bad thing--now, maybe part of the treatment is to encourage people to move. Or maybe going to another country is kind of a random thing that you might want to control for. So, there's sort of a degree of freedom in the analysis right there. Another degree of freedom in the analysis is that they actually had 4 groups, not 2 groups, because, if I'm remembering it correctly, they crossed the intervention with some dietary intervention, I think, giving minerals. I can't remember the details. But I think they concluded that the dietary intervention didn't really seem to have an effect. So they averaged over that. But of course, had they found something, that would have been reportable. The published paper actually--the pre-print had 42%; and then the published paper, I think the estimate was more like 25%. So, there's a lot. And I guess the standard error was lower, too. So a lot depends on how you actually do the regression. Which I'm sure you're aware of, from your experience. Well, let me just sort of think: It's not--I don't feel that--it's not that any of their choices were wrong. So, I don't think that what they did with the people who moved to other countries was necessarily a bad choice. You have to choose how to analyze it. I don't think it was necessarily a wrong thing to aggregate over the intervention that didn't seem to make a difference. Certainly I don't think it's wrong to run a regression analysis. I've written a whole book about regression. It's not that any of the analyses are wrong. It's that, when you do that, given that you kind of know the goal, is to get a win, you are more likely to find statistical significance than you would have. And I actually have written about this--that there's a term, 'p-hacking,' which is that people hack the data until they get p less than .05, and they can publish. And, I don't really--I have no reason to think the authors of this paper were "p-hacking." I don't think--it's not like a matter of people cheating. It's not a matter of, like, people like craftily trying to manipulate the system. I mean, the guys who wrote this paper have very secure careers. If they didn't think the effect was large, they would have no motivation to exaggerate it. But, they are using statistical procedures, which happen to be biased, because of selection. And it's hard to avoid that. Just like if you are doctor and you don't blind yourself. You can have all the good will in the world, but you are using a biased procedure and it's hard to correct for bias.

18:14

Russ Roberts: It's a deep, really deep point. And I really like your distinction between p-hacking and the garden of the forking paths. And we'll come back to that phrase--garden of the forking paths--to make a little clearer why that's the phrase that you use. But, p-hacking has a negative connotation. It sounds corrupt. It sounds like you've done something--as you've said, either dishonest or fraudulent. And, tragically, it's not. It's just common research practice that if you run a result, you run an analysis, and you don't get an interesting result, your natural inclination is to tinker. 'What if we try a different specification? What if we throw out the people who moved? What if I treated the people who moved to the Western hemisphere different from the people who moved to the Eastern hemisphere? What if I--?' There's so many choices. And that's the garden of the forking paths: you have so many decision modes that you have to inevitably make in these kind of analyses where there's a lot going on--and the world's a complicated place--that your natural inclination is to try different stuff. And the economics version of this is Ed Leamer's paper, "Let's Take the Con Out of Econometrics", where he basically says: When you are doing this, you no longer have the situation where the classical statistical test of p < .05 is the right one. Because it's not a one-time thing. You've made all these other choices.

Andrew Gelman: Let me pick up on that. Because this relates to this concept of the replication crisis. Before I get into that, let me also interject that Uri Simonsohn and his colleagues, [?] psychologists who coined the terms p-hacking and researcher degrees of freedom, if you read their paper, they never say that p-hacking is cheating. I mean, they are pretty clear. So, I don't like the term 'p-hacking' because I think it implies cheating. But I just want to say that the people who came up with term were quite scrupulous. They are, I think, actually, nicer than I am. They had a paper called--their paper from 2011 was called "False Positive Psychology." About how you can get false positives through p-hacking. I wrote something on the blog and I said how they used that to mock--there's a sub-field of psychology called Positive Psychology, which is about how[?] psychology can help you. Which happens to be plagued with studies that are flawed. And I wrote that, um, the title "False Positive Psychology" was a play on words. And, Uri Simonsohn emailed me and said, 'No! They had no meaning to--

Russ Roberts: It was an accident--

Andrew Gelman: he just--he is a nice guy; he wasn't doing that. Now, let me come back to--so, there's something called a Replication Crisis in Psychology that--

Russ Roberts: We've interviewed Bryan Nosek a couple of times on the program.

Andrew Gelman: Okay.

Russ Roberts: So, we're into it. But describe what it is. And it's very important.

Andrew Gelman: People have done studies which are either published or appear completely successful--have good p-values. Later on people try to replicate them, and the replications fail. And the question is: Why does the replication fail? When a replication fails, it's natural to say, 'How does the new study differ from the old study?' Actually, though, typically the main way the new and the old studies differ is that the new study is controlled. Meaning, you actually know ahead of time what you are going to look for. Whereas the old study is uncontrolled. So, you can do p-hacking on the old study but not on the new study. And Nosek himself--he must have told you--he did a study, the so-called 50 stages--

Russ Roberts: "50 Shades of Gray"--

Andrew Gelman: said he, with his own study that he was ready to publish with his collaborators; and they replicated it; and it didn't replicate. And they realized they had p-hacked without realizing. So, let me get back to the early childhood intervention. So, the usual, one way you could handle this result is you could say, 'It's an interesting study. Great. Okay. There's forking paths. We don't know if we believe this result, so let's replicate. So, the trouble is, that to replicate this would mean waiting another 25 years.

Russ Roberts: Yeah.

Andrew Gelman: And what are you going to do in the meantime? So, replication is a lot easier in psychology than it is in economics or political science. We can't just like say, 'I want to learn more about international relations so let's start a few more wars. And rip up some trees. And see what happens.' Or, throughout the economy, 'Let's create a couple more recessions and create some discontinuities.' It might happen, but people aren't doing it on purpose.

Russ Roberts: It's hard to get a large number where the other things are constant. There's always the potential to say, 'This time was different,' because this depression, or this recession was started by the housing sector, or the whatever--so it actually can't even be generalized with all those other ones.

Andrew Gelman: Yeah. You just can't replicate it. So, it puts us--what I'd like to get back to in these examples--it puts us in a difficult position. Because on the one hand, I think these claims are way overstated. On the other hand, you have to do something different. I'd like to share two more examples, if there's a chance. But you tell me when is the best time.

23:17

Russ Roberts: Well, I want to talk about, in the psychology literature particularly, this issue of priming, that was recently talked about. But we can talk about lots of things. So, what do you want to talk about?

Andrew Gelman: I wanted to give two--I want to give three examples. So the first example was something that really matters, and there's a lot of belief that early childhood intervention should work. Although the number of 42% sounds a little high. But, and also where people also actually care about how much it helps. It's not enough to say--even if you could somehow prove beyond a shred of a doubt that it's had a positive effect, you'd need to know how much of an effect it is. Because it's always being compared to other potential uses of tax dollars. Or individual dollars. So, I want to bring up two other examples. So, the second example is from a few years ago. There was a psychologist at Cornell University who did an experiment on Cornell students of ESP (Extra Sensory Perception). And he wrote a paper finding that these students could fortell the future. And it was one of these lab experiments--I don't remember the details but they could click on something and somehow you--it was one of these things where you could only know the right answer after you clicked on it. And he felt, he claimed that they were predicting the future. And if you look carefully--

Russ Roberts: Andrew, I've got to say before you go on--when I saw the study, the articles on that, I thought it was from the Onion. But, it's evidently a real paper. Any one of these, by the way, strikes me as an Onion article--that people named Dennis are more likely to be dentists.

Andrew Gelman: I was going to get to that one.

Russ Roberts: Yeah, well, go ahead. Go with the ESP first.

Andrew Gelman: So, the early childhood intervention is certainly no Onion article. The ESP article--it was published in the Journal of Personality and Social Psychology, which is one of the top journals in the field. Now, when it came out, the take on it was that it was an impeccably-done study; and, sure, like people--most people didn't believe it. I don't even think the Editor of the journal believed it. They published it nonetheless. Why did they publish it? Part of it is, like, we're scientists and we don't want to be suppressing stuff just because we don't believe it. But part of it was the take on it--which I disagree with, by the way. But at the time, the take on it was that this was an impeccably-done study, was high quality research; it had to be published because if you are publishing these other things you have to publish this, too. And there's something wrong. Like, once it came out, there's obviously something wrong there. Like, what did they do wrong? It was like a big mystery. Oh, and by the way: The paper was featured, among other places, completely uncritically on the Freakonomics blog.

Russ Roberts: I'm sure it made the front page of newspapers and the nightly news--

Andrew Gelman: It was on the front page of the New York Times. Yeah. So in the newspaper--they were more careful in the newspaper than in Freakonomics, and they wrote something like, 'People don't really believe it, but this is a conundrum.' If you look at the paper carefully, it had so many forking paths: there's so much p-hacking--almost every paragraph in the results section--they try one thing, it doesn't work. They try something else. It's the opposite of a controlled study. The experiment was controlled: they randomly assigned treatments. But then the analysis was completely uncontrolled. It's super-clear that they had many more than 20 things they could have done for every section, for every experiment. It's not at all a surprise that they could have got statistical significance. And what's funny is when it came out, a lot of people--like, the journal editor--were like, 'Oh, this is solid work.' Well, like, that's what people do in psychology. This is a standard thing. But when you look at it carefully it's completely--it was terrible.

Russ Roberts: So, in that example--I mean, what's interesting about that for me is that you say, 'In the results it was clear to you.' But of course in retrospect, in many published studies--the phrase I like is 'We don't get to be in the kitchen with the statistician, the economist, the psychologist. We don't know what was accepted and rejected.' So, one of my favorites is baseball players whose names start with 'K' are more likely to strike out. Well, did you look at basketball players and see if their names start with 'A' are more likely to have assists? Did you look at--how many things did you look at? And if you don't tell me that--'K' is the scoring letter for strike-out, for those listening at home who are not from America; or who don't follow baseball; or who don't score--keep track of the game via scorecard; 'K' is a shorthand abbreviation for strikeout--which, of course, is funny because I'm sure some athletes don't know that either. But the claim was that they are more likely to strike out. I don't know the full range of things that the author has tested for unless they give me what I've started to call the Go-Pro--you wear the HeadCam [head camera]--and I get to see all your regressions; and all your different specifications; and all the assumptions you made about sample; and who you excluded; and what outliers. Now, sometimes you get some of that detail. Sometimes authors will tell you.

Andrew Gelman: This is like cops--like research [? audio garbled--Econlib Ed.]

Russ Roberts: Exactly.

Andrew Gelman: So, it's actually worse than that. It's not just all the analyses you did. It's all the analysis you could have done. And so, some people wrote a paper, and they had a statistically significant result, and I didn't believe it; and I gave all these reasons, and I said how it's the garden of forking paths: If you had seen other data you would have done--you could have done your analysis differently. And they were very indignant. And they said, 'How can you dismiss what we did based on--and your assumption'--that's me--'how can I dismiss what they did based on my assumption about what they would have done, had the data had been different? That seems super-unfair.'

Russ Roberts: It does.

Andrew Gelman: Like, how is it that I come in from the outside? And the answer is that, if you report a p-value in your paper--a probability that a result would have been more extreme, had the data come from, at random--your p-value is literally a statement about what you would have done had the data been different. So the burden is on you. So, to get back to the person in the, you know, who bugs you at the cocktail party, if someone says, 'This is statistically significant, the p-value is less than .05; therefore had the data been noise it's very unlikely we would have seen this,' they are making a statement saying, 'Had the data looked different, we would have done the exact same analysis.' They are making a statement about what they would have done. So, the GoPro wasn't even quite enough. Because my take on it is people navigate their data. So, you see an interesting pattern in some data, and then you go test it. It's not--like, the thing with the assists, the letter 'A', whatever--maybe they never did that. However, had the data been different maybe they would have looked at something different. They would have been able--

Russ Roberts: And someone, I didn't read carefully in this, but someone did write a response to that article saying that it turned out that people with the letter 'O' struck out even more often. What do you do with that? Which is a different variation on that, all the possible things you could have looked at.

Andrew Gelman: Well, they also found that--my favorite was that lawyers--they felt, they looked the number of lawyers named 'Laura,' and the number of dentists named 'Dennis'. And there are about twice as many lawyers named 'Laura' and dentists named Dennis as you would expect if the names were just at random. And I believe this. So, when I--

Russ Roberts: Twice as much! How could you--it's obviously not random!

Andrew Gelman: Well, no. Well, twice as much--well, yeah. Twice as much is first--

Russ Roberts: Twice as much as what?

Andrew Gelman: It's not as ridiculous as you might think. So, it goes like this. Very few people are dentists. So, if like 1% of the people named 'Dennis' decide to become dentists, that will be enough to double the number of dentists named 'Dennis.' Because it's a rare career choice. So, it's, in some way it's not the most implausible story in the world. It actually takes only a small number of people to choose their career based on their name for it to completely do this to this to the statistics. But--and I bought it. I was writing about it. But then someone pointed out that people named 'Laura'--the name 'Laura' and 'Dennis' were actually quite popular many years ago--like I guess when we were kids or even before then. And when the study was done, where the lawyers and dentists in the study were mostly middle-aged people. So, in fact they hadn't corrected for the age distribution. So there was something that they hadn't thought of. It was an uncontrolled study. So, I bring up the ESP only because that's a case where, like, it's pretty plausible that it was just noise. And then when you look carefully at what they did, it's pretty clear that they did just zillions of different analyses.

32:20

Russ Roberts: So, I want to bring up the Priming example, because I want to make sure we get to it. And then I want to let you do some psychological analysis of my psyche. But the Priming example is, there was a very respected and incredibly highly cited paper that took a bunch of undergraduates, put them in a room, asked them to form sentences with 5 different words. And that really wasn't the experiment. The real experiment was watching what they did when they left the room. And the 5--one group got 5 words that were associated with the elderly, like 'Florida,' and 'bald' and 'wrinkly' and 'old'--not 'old' but 'subtle'. 'Subtle,' 'gray'.

Andrew Gelman: 'Tenured professor,' right, was there?

Russ Roberts: Right. Gray. I don't remember. And then one of the--the control group, the other group, got sort of regular words. And it turned out that the people who got the 'old' words, like, 'Florida,' 'wrinkly,' 'old,' 'gray,' and 'bald,'--they left the room more slowly, because they'd been primed to think about being old. Now, none of them, of course, asked for a walker. This is my bad joke about this kind of study. It's like, these are going to have to be somewhat subtle effects. And yet they found a statistically significant result that, when the replication attempt was tried, was not found to be successful. They could not replicate this result. And of course there was a big argument back and forth between the original authors, whether they did it right. But my view was always this seemed silly to me. And your point about small samples--and these are very small samples, I think it's 30 or 50 undergraduates, where the speed of living[?], the room, is going to be highly noisy, meaning high standard-error. So to find a statistically significant effect, going to find a big effect, to me is implausible. But that's what they found. And then it didn't replicate.

Andrew Gelman: Oh, but it's worse than that. Because between the original study and the non-replication, there were maybe 300 papers that cited the original paper. What were called 'conceptual replications.' What appeared to be replicated. So, someone would do a new study and they would test something slightly different--a different set of words, different conditions--and find a different pattern. Like you might--maybe there was another study where you'd do a certain kind of word and people would end up walking faster, not slower.

Russ Roberts: That seemed to confirm the original result. Overwhelmingly. Because, as you say, hundreds of studies found the existence of priming once they knew to look for it. Why isn't that true? Why isn't priming real?

Andrew Gelman: Right. That's the problem. It's that--well, of course, priming is real. Everything is real, at that level. It's that it varies. So, different things, I think, give you a couple of words, and a lot of people, it won't prime you at all. For anything but depending on who you are, it might really tick you off; it might remind you that you have to go the bathroom and you have to walk faster. There are all sorts of things it can do. The effect is highly variable. In fact, the concept, the effect, is kind of meaningless. Because it's the nature of these kind of indirect stimuli to do different things to different people in different scenarios. So, I think part of the problem is the theoretical framework. They have a sort of button-pushing, take-a-pill model of the world. So, this idea that like, oh, you push the button and it makes people walk a lot slower--that's very naive. Just a psychological--just treating it as a psychological theory, it's naive. But it's a self-contained system. You do a study, and it's possible to get statistical significance through forking paths. Do another study: If it shows the same thing as the first study, that's great. If it doesn't, you come up with a story why that isn't. Then, you come up with a story and you find a pattern in data, itself statistically significant. That's a second study. This can go on forever. There's really no way of stopping it. The only way of stopping it, perhaps, is to do, either through theoretical analyses of the sort that I do, to explain why statistical significance is not so meaningful. Or, by just brute force running a replication. And running a replication is great. You can't very often do that in political science and econ, but when you can do it, it sort of shuts people up. For sure.

36:28

Russ Roberts: So, in this case, the dozens or hundreds of statistically significant results of priming didn't seem to be confirmed by the attempts to replicate them. As you point out, where you have one choice. We're going to look at these kind of words and see if people walk slower. As opposed to, 'Well, they walked faster. I guess that's because--' and it's statistically significant, or, 'I tried a different set of words.' Or I tried a different group. And so, somebody blogged on this recently. And Daniel Kahneman, Nobel Laureate, commented on the blog--and apparently it actually was him. There's always some uncertainty about whether he actually commented. Because he had had a chapter in his book, Thinking Fast and Slow, on Priming. And he conceded--he had actually written about it a few years ago. But he conceded that: these results are probably not reliable. And this was the shocking part for me--and we'll link to this, because I'm going to write about it. It's just stunning. He said, 'Well, I just assumed that, since they had survived the peer-review process, I have to accept them as scientific.' And that was the most stunning thing I'd--besides the fact that he conceded that his chapter was probably not reliable, the fact that he also conceded that he had used the fact that they had survived the peer-review process as sufficient to prove their scientific merit was also stunning to me. Now, he's conceding--I think correctly--that peer review is not an infallible scientific barometer. Which is good--that's a good thing.

Andrew Gelman: It is a good thing. But it took us a while to realize that.

38:01

Russ Roberts: But that brings us to my problem. Which I want your help with. So, now what? So, I'm a skeptic. So, I tend to reject--I don't like psychology so much. I don't like these cutesy results that fill all these pop books by authors whose names we are not going to mention, that use these really clever, bizarre results; but they are peer-reviewed and they are statistically significant. So, I tend to make fun of all of them. Even before I've looked at the study. And that's not a good habit. And similarly, in economics, the idea that we can control for these factors and measure, say, 'The Multiplier'--to come back the 'The' Priming Effect, strikes me as foolish, silly, and unscientific. But, don't I have a problem of going too far? Can I now--I'm tempted to reject all of these findings in economics, epidemiology, social psychology. Because I say, 'Oh, they are all p-hacking. It's the garden of the forking paths. It's the file-drawer bias.' Etc., etc. Almost none of them replicate. Or, do I say, 'Well, I'm going to keep an open mind. Some of them might replicate. Some of them might be true. And if so, how do I decide which ones?' Help me out, Doctor.

Andrew Gelman: Um, I think we have to move away from the idea of which are true and which are false. So, setting aside things like the ESP, which may be--I'm not an expert in that topic, but there's certainly a lot of people who take the reasonable view that there's nothing going on at all there. But generally, I think there are effects. The Early Childhood Intervention has effects on individual kids. And I think that priming people has effects. Just that consistent effects, the consistent effect of something like priming, it's just the average of all local effects. The same with early childhood intervention. The reported effect, the average treatment effect, of early childhood intervention, is the average of all the effects on individual kids. It will be positive for some and negative for others. It's going to vary in size. So, I don't think you should put yourself in a position of having to decide, 'Does it work or not?' I think everything works, everything has an effect. Sometimes, some of these things, maybe it doesn't matter. Like, the Priming--I don't think, what are you supposed to do with the priming? So, let's, let's like, you are supposed to do, 'Oh well,' maybe the priming will make a difference. So, if you are, for example, if you are a company, you are advising a company, and they are advertising. So, would a certain prime help themselves sell more of a product. Or, if you don't like the advertising, you are working for a government agency, you are trying to prime people to have better behavior or trying to prime soldiers to be less afraid or whatever it is. That's going to be a specific context. And, you know--I think you want to study it in that specific context. I don't think we're going to learn much from some literature in the psychology labs, seeing words on the screen.

40:58

Russ Roberts: Well, let me take a slightly more important example. So, I'm not going to argue that what you just said is un-important. But it's relatively unimportant. It would be scary if the government or if corporations were secretly influencing us. I mean, an example of this would be when they flash-bot--allegedly flash-by Coke--so see it, allegedly during movies, and people rushed out supposedly at intermission and bought a lot of Coke without realizing that they'd seen these simple subliminal suggestions. And I don't think--it turned out--I've seen the real actual study of that. It really didn't work. But somehow that's became this scary thing. And if it were true--it would be scary. But let's take the minimum wage. Does an increase in the minimum wage affect employment? Job opportunities for [?] there's a lot of smart people on both sides of this issue who disagree. And who have empirical work that they're right and you're wrong; and each side feels smug: that it's studies are the good studies. And I reject your claim that I have to accept that it's true or not true. I mean, I'm not sure--which way do I go there? I don't know what to do.

Andrew Gelman: Well, I think--

Russ Roberts: Well, excuse me: I do know what to do. Which is, I'm going to rely on something other than the latest statistical analysis. Because I know it's noisy and full of problems, and has probably been p-hacked. I'm going to rely on basic economic logic, the incentives that I've seen work over and over and over again. And at my level of empirical evidence that the minimum wage isn't good for low-income people is that fact that firms ship jobs overseas to save money; they change, they put in automation to save money. And I assume that when you put in the minimum wage they are going to find ways to save money there, too. So, I--it's not a made-up religious view. I have evidence for it. But it's not statistical. So, what do I do there?

Andrew Gelman: Okay. I'd rather not talk too much about the minimum wage because it has a lot of technical knowledge that I'm not an expert on. Last time I took economics, was in 11th grade. I did get an A in the class, but still I wouldn't say I'm an expert on the minimum wage. But let's talk a little bit about that. So, the first thing is that I do think that having a minimum wage policy would have effects on a lot of people. And it will help some people and hurt other people. So, I think the hypothesis that the minimum wage has 0 effect is kind of silly. Of course it's going to have an effect. And obviously there are going to be people who are going to get paid more, and other people who aren't going to get hired. So, part of it is just quantitative: How much is the effect going to be? Who is it going to be helping and hurting? The other thing--I agree with you completely about the role of theory, that your theory has to help you understand it. I think it's possible to fill in the gaps a little bit. So, to say, 'You have a theory about what firms will do in response to the minimum wage; and you have evidence based on how firms have responded to the minimum wage in the past. But you can argue quite reasonably that number of minimum wage changes is fairly small and idiosyncratic.' So you [?] how firms have responded to many other stressors in the past. And so, you have a theory. I think that one could build a statistical model incorporating those data into your theory. So, you'd have a model that says stressors have different effects; you characterize a stressor are--you'd have some that are somehow more similar to minimum wage like regulatory stressors versus other things, which are economic rather than political; how much--when prices of raw materials change, so forth. It should be possible to fill in the steps: to connect from the theory to the empirics and ultimately [?] make a decision. Let me talk about early childhood intervention, though, instead. Not that I know anything about that, either; but that's an area where our theory is weaker.

Russ Roberts: Maybe. I don't know.

Andrew Gelman: Okay. So, if we talk about theory of early childhood intervention, there's two theories out there. One theory is that this should help because there are certain deficits that kids have and you are directly targeting them. There is another theory that says most things won't help much because people are already doing their best. Right? So those are, sort of, in some sense those are your two theories to get things started.

Russ Roberts: There's another one: Nature is stronger than nurture, so it doesn't really matter. Nurture is over-rated. You know, that's another theory.

Andrew Gelman: Sure. Indeed. So there's another theory that these things won't have such large effects; that the deficits that people have are symptoms, not causes; and so reducing these deficits might not solve the problem. I mean, for that matter, it's not even nature versus nurture: it's individual versus group. So, if Jamaica is a poor country, it could be their environment. And so changing some aspect--so, sure. So, basically we have a bunch of theories going on. And, if you want to understand them, you probably have to get a little closer to the theory in terms of what's measured. Now, in the meantime, you have decisions to make. So, this study that was done was in a more traditional--like, take a pill, push a button. Like, the experts came up with an intervention. One way to frame this--I think it's very easy to be dismissive of experts. But one way to frame this in a positive way, I believe, is: Imagine that this wasn't an experiment. Imagine that there was just a fixed budget for early childhood intervention--like the government was going to spend x on it, and that was what the people wanted. If you have a fixed budget, you might as well do the best job. And of course you should talk to education experts, and economics experts, and so forth. I wouldn't want just some non-experts to make it up. Like, experts have flaws, but presumably people who know about curricula and child development could do a better job. I would assume.

Russ Roberts: I'm going to let you assume that, but I'm going to also argue that the evidence for that is very weak. But, go ahead.

Andrew Gelman: Okay. Well, let me just say that--let me say that I doubt that they are going to be worse. Assuming that they don't have, if they don't have motivation in[?]--

Russ Roberts: There's fads. There's group-think.

Andrew Gelman: Sure. I do[?] take that. I'll accept that. Let's say that--put it another way. I'll accept your point; and let me just step back and--forget about who is doing it. Suppose you are doing it. Or suppose some group is doing it. Suppose there is a mandate to do some level of early childhood intervention, just as there's a mandate to have public education in this country, and so forth. Somehow, you want to do a better job rather than a worse job. So, however that's done, some approach is chosen. And so, you are going to do that. Now, then there's a question of how effective this thing is going to be. And here, this just gets back to the statistics. So, with 130 kids, it's going to be hard to detect much, because the error is so high. The variation is so high. So, it's going to be hard to use a study like this to make a decision. And one of the problems is that we are kind of conditioned to think that if you--we're conditioned to think that the point of social science is to get these definitive studies, these definitive experiments. And we're going to prove that the drug works, or that the treatment works. And, that's kind of a mistake. And partly because of the small sample. But not even just that. It's also because conditions change. What worked in Jamaica 25 years ago might not work in Jamaica now, let alone the United States right now. So, there is no substitute for a theory. I think there is no substitute for observational data. So, economists use a lot of observational data. There's millions of kids who go to school, and millions of kids who do preschool. So, in some sense, you have to do that: you have to do the observational analysis. You have to have theory. Ultimately, decisions have to be made. I agree with your skepticism about your own skepticism. That is, saying, 'I don't trust this 42%,' doesn't mean you can say, 'I believe it's 0.' [?] You don't know. And so you have to kind of triangulate. And I think, one thing I like to say is that I think research should be more real-world-like; and the real world should be more research-like. And economists are moving into this. So, more and more people are doing field experiments rather than lab experiments; are doing big studies. So they are trying to make research more realistic. But the flip side is that these studies are small. And they have this noise problem. And, it's taken people a while to realize this. So, a lot of people felt that if you have a field experiment, if you have a field experiment, you have identification because it's an experiment. And you have external validity, because it's in the field. Therefore you have a win. But actually, that's not the case. If it's too noisy a study, and too small a study, the identification and the generalizability aren't enough. So, the flip is if people are doing policy, they need to have good records. The organization should be keeping track of which kids are getting preschool and which kids aren't. And how they are doing. And future statisticians, economists, and sociologists should study these data. And, yes, they are going to have arguments, just like they have arguments about the minimum wage. But, I think you have to kind of do your best on that.

50:31

Russ Roberts: So, let me try a different approach. I'm not proud of this; but I'm going to push it and see whether I can sell you at all on it, okay? So, when you talk about field experiments, I'm thinking about the deworming literature, which had a lot of enthusiasm for deworming. And there was a huge encouragement to give to charities that deworm poor children in Africa because a field experiment that found that they did much better in all kinds of dimensions. And when they did it on a very large scale, it didn't work so well. Now, there's pushback from the people who did the first study; and I don't know where we are on that. As you point out: I think de-worming is probably better than not de-worming. The magnitude is what's at issue here, and the variation across individuals. And of course we had a guest on EconTalk who said we have too sterile an environment and that's leading to many autoimmune problems. So, even the question of whether it's good or not is maybe a little bit up in the air. But, here's the way I see it--and I don't like this way of seeing it, but I find myself increasingly drawn to this perspective. Which is: You are arguing, 'Well, you've got to be more honest; you've got to look at a bigger sample, you've got to be more thorough; you've got to keep better records. You've got to be more skeptical; you've got to not oversell. You've got to be aware of the biases.' And I agree with all of that, 100%. But wouldn't you--isn't it possible that the study of statistics, the way it's taught in a Ph.D. program in statistics and the way it's taught in economics and econometrics--is it just giving you a cudgel, a club, a stick, with which to beat your intellectually inferior opponents? And all it really comes down to is ideology and gut instinct? So, when you tell me about priming and I do, 'Get, that [?] strike me as plausible?' And I'm right. Hey, my biases, my gut, turned out to be better than the statistical analysis.' And you tell me about the minimum wage, and we go back and forth with all these incredibly complex statistical analyses, and it turns out, maybe something will come to a consensus--I have no idea. But I'm tempted to just rely on my gut feeling, and to be honest about it. As opposed to pretending, as most--I fear--young economists do now: 'Oh, I just listen to the data. I don't have any preconceptions. I see what the data tell me.' And I find that to be dangerous, to be honest. And I'd almost rather live in a world where people said, 'I'm not going to pretend that my opinion is scientifically based, because there's not much science. There's a lot of pages in the appendix; there's a lot of Greek letters. But the truth is, it's mostly just my gut with a few facts.'

Andrew Gelman: I think it depends on the context. I have certainly worked on a lot of problems where people change their views based on the data. And we--it's always--there's always new data coming. So, we estimated the effects of redistricting--you know, gerrymandering. And we estimate--we did a paper in 1994 which was based on data from the 1960s and 1970s, I think, or maybe the 1970s and 1980s--I'm not remembering; I think it was the 1960s and 1970s. Anyway, we looked a bunch of redistrictings, and we found that the effective redistricting was largely to make elections more competitive, not less competitive. Now, we found that in the data, and that changed how people viewed things. Now, is that still the case? Maybe not. So, redistricting has become much more advanced than it used to be. You can gerrymander, like you couldn't gerrymander before. So, our conclusions were time-bound. But, in doing that, we learned something new. We did an analysis of decision-making for radon gas--for radon in your house, which can give you cancer--and using a sort of technocratic approach or a statistical approach, we found that a targeted measurement and intervention, if applied nationally, could save billions of dollars without losing any lives. I do work in toxicology and pharmacology, where we have fairly specific models. I don't use statistical significance to make these decisions. So, when I do these analyses, we do use prior information. And we're very explicit about it. But it's information. So I liked what you said--what you said about the minimum wage was, you said, you have a theory as to why it's counterproductive; and you feel you have data. And with care, they could be put together, and put into a larger model. And, sure, there's going to be political debates. I'm not denying that. But I think there's a lot of room between, on the one hand, things that are so politicized--

Russ Roberts: and complex.

Andrew Gelman: Well, not even that complex. But on some hand, some things are so politicized it's going to be very hard for some people to judge, and you have to sort of rely on the political process. And on the other extreme, maybe the other extreme would be studies that are purely data-based, looking at statistical significance, like this ESP study, that are just wrong. I think there's a lot of room in between. I don't think--the fact that we're not going to use science to solve the issue of abortion, or whatever, or maybe even the minimum wage will be difficult--I don't think that means that science or social science are useless. I think that it is part of the political process. I mean, you might as well say that, like, public health is useless because some people aren't going to quit smoking. Like, well, on the margin it could still make a difference, right? And there are lot of things that are maybe easier to quit, easier for people to change things.

Russ Roberts: The problem with that argument--it's a good point. I take the point. It's a great point. Here's the problem with it. The problem with it is that there are all these errors on the other side, where we do something that's actually--we encourage people, we are encouraging people to smoke because of, we've got empirical, so-called statistical, scientific studies that show that x is good or y is bad--

Andrew Gelman: Well, but that's why we should do better statistics. I think that--I don't think--right. Someone wrote a paper where they looked at--well, I mean like, there's lots of papers like cancer cure-of-the-week; everything causes cancer--

Russ Roberts: Right.

Andrew Gelman: everything cures cancer; everything prevents cancer. Right. And someone did this study and they found there had been published papers with a lot of food ingredients that were said to both cause cancer and cure cancer. You know, who knows? Maybe they do. Whatever. Right. And so, I think that we do need to have better statistical analyses. I think that people have to move away from statistical significance. I think that's misleading. People have to have an understanding that when they do a noisy study and they get a large effect, that that's not as meaningful as they think. But, within the context of doing that, I do feel that we've learned. At least, I feel like I have learned from statistical analyses--I've learned things that I couldn't have otherwise learned. A lot of--look at baseball. Look at Bill James. He wasn't doing things of significance[?]--

Russ Roberts: I think about him all the time--

Andrew Gelman: Bill James, like, he learned a lot from data,--

Russ Roberts: 100% correct.

Andrew Gelman: from a combination of data and theory, replicating going back and checking on new data. I think if Bill James had been operating based on the principles of standard statistical methods in psychology, he would have discovered a lot less. So, I'd like to move towards a Bill James world, even if that means that there's still going to be places where people are making bad decisions.

Russ Roberts: So, for people who don't know: Bill James wrote the Baseball Abstract for a number of years and has written many books using data to analyze baseball. And he's considered the founder of the Sabermetrics movement, which is the application of statistics to baseball--as opposed to people who follow their gut. And, as it turns out, I'm an enormous Bill James fan. And I'm an enormous believer that data in baseball is more reliable than the naked eye watching, say, a player over even 30 or 40 games. A game like baseball, where the effects are quite small, actually, it's important to use statistical analysis.

58:54

Russ Roberts: I think the challenge is, is that baseball is very different from, say, the economy. Or the human body. Baseball is a controlled environment: so, you can actually measure pretty accurately, either through simulation or through actual data analysis, say, whether trying to steal a base is a good idea. There is a selection bias. There's issues--it's not 100% straightforward. You can still do it badly. But you can actually learn about the probability of a stolen base leading to a run, and actually get pretty good at measuring that. What I think we can't measure is the probability of a billion dollar, or trillion dollar stimulus package in helping us recover from a recession. That's what I'm a little more skeptical about. Well, actually, a lot more.

Andrew Gelman: No, those things are inherently much more theory-based. I mean, the baseball analogy would be, Bill James has suggested, like, reorganizing baseball in various ways.

Russ Roberts: Right. Great example.

Andrew Gelman: So, if you were to change [?] well, who's to say? But, again, okay sure. I'm not going to somehow defend if someone says, 'Well, I have this statistically significant result, therefore you should do this in the economy'. But I think that there are a lot of intermediate steps. I've done a lot of work in Political Science, which is not as controlled as baseball. And it is true: The more controlled the environment is, the more you can learn. It's easier to study U.S. Presidential elections than it is to study primary elections. The general elections, easier to study than the primary election, because the general election is controlled, and the primary election is uncontrolled. Pretty much. So, the principle still applies. And I think--you are right. But there is a lot of--

Russ Roberts: Yeah, I don't--don't misunderstand--

Andrew Gelman: back [?] social and biological world that have enough regularity that it seems like we can study them.

1:00:43

Russ Roberts: Oh, I agree. And I agree with you 100%. Don't misunderstand me. I don't want my listeners to misunderstand me. And I think there's a temptation, when you hear people like me, to say, 'I'm a scientist. I think we can do better. And you sound like you are not a scientist. You don't think these methods help at all. And you just want to use your gut.' And that, of course, is a bad idea also. So, obviously, I'm really pushing for nuance, away from the extreme. And the extreme is the one we're pretty much in. Which is: There's an enormous number of people, a few, I don't know, tens of thousands, of economists, political scientists, psychologists, epidemiologists, and others, who sit around all day and analyze data. There's an enormous proportion of that work that is both inaccurate and misleading. And the system that we currently use to decide who is doing "good work," "good science," is extremely flawed. So the lesson for me is high levels of skepticism about the "latest paper" that shows that broccoli cures cancer or causes it. And yet we are not so good at that. So, I want a--people like you, who have reminded--not journalists, but practitioners--that what they are doing is really not consistent sometimes with the probable code of conduct that they want to live by. And I think the Replication Project that Nosek and others are working on is God's work. It's a phenomenally important thing to help improve this underlying process. But the fact is, that the underlying incentives of the system are just not so conducive to truth. And I think the more we're aware of that, the more careful and correctly cautious we should be.

Andrew Gelman: I agree. And I would also say that, like--when you say that what we're doing isn't always as social scientists or epidemiologists or whatever, isn't always what we'd want to do. And you could say there's sort of two directions. So, one thing that people are very familiar with is the idea of moving towards the ideal. So, 'Oh, if the p-value assumes that you've pre-registered your analysis, then do a pre-registered analysis.' Right? Or, if this assumes this thing, then make sure the assumptions are correct. So, in economics, you have: Get your identification. But there's another direction, which is, like, to move the mountain to you: Which is to develop statistical methods that are actually closer to how we can actually learn. And so a lot of my work is not about saying, my applied work, is not about saying, 'Well, I want to get my p-values correct so let me make sure that I follow all the rules.' It's the opposite. It's let's use statistical methods that allow me to integrate theory with data more effectively, and more openly.

Russ Roberts: That's a fantastic distinction. And I just want to mention, you recently blogged about this--I think. I was kind of stunned when I read this--you are bringing out, I think, a new edition of your book with Jennifer Hill, which is called Data Analysis Using Regression and Multilevel/Hierarchical Models--and you said, I think, that you are not going to use the phrase 'statistical significance' in that new edition. Is that correct?

Andrew Gelman: Yeah. Jennifer made me take it out. I mean, she was right. Like, we had stuff like, 'Here's the result, and it's statistically significant.' And, I would say our book was quite moderate in how we used it. Like, we always emphasize that statistical significance doesn't mean it's true; it's just a description of the data. But she, Jennifer, convinced me that that's kind of silly. Like, why is it a description of the data? Why are we doing this? We shouldn't be. And so, we are removing all of those. That's right.

Russ Roberts: So, I'm just going to close with that--which stuns me. And let me remind listeners who are not trained in statistics that significance in statistics, as we made clear at the very beginning--we didn't make clear enough--it's a very formal word. It's not what it means in the English language, which means important or relevant or note-worthy. But I want to close with an observation of Adam Cifu, another former EconTalk guest, who, in a book, co-authored book, called Medical Reversal, he made the observation that many, many, many multivariate analyses and studies which show statistical significance of some new technique, or new device--when then put into a randomized control trial, do not hold up. It's another version of this kind of failure to replicate. It's a particular version of it in the medical field. Extremely important, because not only--it involves life and death, and it's incredibly a large amount of money. But mostly life and death is why it's important. And, when you see that--when you see that so many of these results don't hold up, when you actually can control for them--what does that tell you about the value of multivariate regression in these complex systems? Like, the body. Like the economy. Etc. Do you want to say that we should be more nuanced? What are you going to say in book, without statistical significance? What's going to be your guidepost or way that a person should trust these kind of results?

Andrew Gelman: Well, I don't do a lot of economics. When I do stuff in the human body, it's often pharmacology. We work with pharmaceutical companies. And there, we are using non-linear models. We are not really--well, it's not so much that we run a regression. We usually have a theory-based [?] model, what's happening within the human body. Although, we sometimes have regressions as part of those models. Like, you have a parameter that varies by person and you might have a regression model on how that parameter varies by age.

Russ Roberts: sector[?], race--

Andrew Gelman: Yeah. Etc. So, the story is that the regression model functions as a kind of smoother. So, you have--I have data on a bunch of people, and these parameters are kind of hard to estimate. So, I partially pool them towards this regression model. And if the regression model--the more accurate the regression model is, the more effective that sort of thing will be. So, you are talking about something a little different, which is to say, 'Well, I have a causal question.' What's the effect of something; we want to control for it by regressing. You might want to get my collaborator, Jennifer Hill, on the phone to talk about that. She's much more of an expert on that particular topic.