Philip Tetlock on Superforecasting

Dec 19 2015

Can you predict the future? Or at least gauge the probability of political or economic events in the near future? Philip Tetlock of the University of Pennsylvania and author of Superforecasting talks with EconTalk host Russ Roberts about his work on assessing probabilities with teams of thoughtful amateurs. Tetlock finds that teams of amateurs trained in gathering information and thinking about it systematically outperformed experts in assigning probabilities of various events in a competition organized by IARPA, research agency under the Director of National Intelligence. In this conversation, Tetlock discusses the meaning, reliability, and usefulness of trying to assign probabilities to one-time events.

LISTEN NOW:

Comment

●

READ TRANSCRIPT

●

DELVE DEEPER

DOWNLOAD

Time	Podcast Episode Highlights
0:33	Intro. [Recording date: November 30, 2015.] Russ: You start with a lot of criticisms, throughout the book I'd say, you have a lot of criticisms of pundits. Some of those have Ph.D.'s and some are journalists and some are just so-called experts, who make predictions. But it turns out a lot of those, you can't really hold their feet to the fire when it comes time to judge whether their predictions are accurate or not, are they good forecasters or not. And why is that? What's the challenge with our sort of day-to-day world where people claim that something is going to happen and then print it in the newspaper? Guest: Well, the pundits of whom you say we're critical, you are probably thinking of people like Tom Friedman or Niall Ferguson, people on the left or people on the right, we identify all sorts. They are all pretty uniformly very smart people. They are very articulate; they are very knowledgeable. They offer, make many observations about world politics and economics that seem very insightful. It is extremely difficult, however, to gauge the degree to which their assessments of possible futures, the consequences of going down one policy path or another, are correct or incorrect because they rely almost exclusively on what we call vague verbiage forecasting--they don't say that there's a 20% likelihood of something happening or an 80% likelihood of something happening. They say things like, 'Well, there's a distinct possibility that there will be global deflation in 2016.' Now, when you ask people what ' distinct possibility' could mean, it could mean anything from about 20% to 80% probability, depending on the mood they are in when they are listening. Russ: I didn't mean to suggest you are critical of them, although you sometimes are. But, you are critical of our culture that takes these vague pronouncements and then there's a gotcha game that gets played by people on the other side. But of course there's always a way to weasel out of it because there's usually some hedging in that verbiage. Correct? Guest: Well, that's right. If you exist in a blame-game culture in which people are going to pounce on you whenever you make an explicit fallibility[?] judgment that appears to be on the wrong side of 'maybe,' it's pretty rational to retreat into vague verbiage. So, we talk in the book about a brilliant journalist, a New York Times journalist, David Leonhardt, who created the Upshot, a quantitative column in the New York Times; and he wrote a piece back in, I guess it was 2011 or 2012, when the Supreme Court narrowly upheld Obamacare by a 5-4 margin. And prediction markets had been putting a 75% probability on the law being overturned. And David Leonhardt, who doesn't have any grudge against prediction markets as far as I know, concluded that the prediction markets got it wrong. Now, that's a harsh judgment on the prediction markets because they make hundreds of predictions on hundreds of different issues over years, and they are not bad. When they say there's a 75% likelihood of something happening, it's pretty close to a 75% likelihood. Which means that 25% of the time it doesn't happen. So if you are going to throw out a very well-calibrated forecasting system every time it's on the wrong side of 'maybe,' you are not going to have any well-calibrated forecasting systems at your disposal. Russ: I would say that's a second problem, really, which is that even when you do quantify your prediction, by definition you are allowing the possibility that it doesn't happen. And then the question is: How do you assess the accuracy or judgment of the person who makes a statement like that? Guest: Yes. Exactly. And that requires some understanding of probability. And some patience and some willingness to look at track records over time.
4:57	Russ: So, let's begin with your particular track record. You've done a lot of research in this area, on this question of whether prediction is possible, how accurate is it, are experts good at forecasting? Talk about your background. We're going to get to the tournament that's at the heart of your book, but I want to start with your research history and what you found in the past and how people reacted to it. Guest: Well, I guess that's another way of asking just exactly how old must I be. Because I've been doing longitudinal forecasting tournaments for a long time. So, let's just put on the table: I'm 61 years old and I got started at this right after I got tenure at the U. of California at Berkeley. And I was a little more than 30 years old--it was 1984. And the Soviet Union still existed; Gorbachev had yet to become General Secretary of the Communist Party of the Soviet Union. And we did our initial pile of studies back in the mid-1980s when people, hawks and doves, were arguing about the best ways of dealing with the Soviet Union. And now we're doing forecasting tournaments as hawks and doves are arguing about the best ways of dealing with the Iranian nuclear program. Or for that matter, for dealing with Russia and the Ukraine. So, we've been running forecasting tournaments off and on for 30-plus years. The first big set of forecasting tournaments were done in the late 1980s and the early 1990s and were reported in a book, Expert Political Judgment, that came out in 2005. And the second wave of forecasting tournaments were much larger, involving many thousands of forecasters, a million plus forecasts, and were sponsored by the U.S. intelligence community. And they ran from 2011 to 2015 and in fact they are still running. So if your readers are interested in signing up for an ongoing forecasting tournament they should[?] visiting the website at gjopen.com. Russ: Going back to the earlier work that you did and before the fall of the Soviet Union, what were some of the main empirical takeaways from that work? Guest: Well, one big takeaway was that--liberals and conservatives had very different policy prescriptions, and they had very different conditional forecasts about what would happen if you went down one policy path or another. And that nobody really came close to predicting the Gorbachev phenomenon. Nobody, for that matter, came really close to predicting the disintegration of the Soviet Union later on. But everyone after the fact seemed to have an explanation that either appropriated credit or deflected blame. Russ: And it was consistent with their worldview, I'm sure. Guest: And meshed perfectly with their prior worldview. So, it was as though we were in an outcome-irrelevant learning situation. It didn't really matter what happened--people would be in an excellent position to interpret what happened as consistent with their prior views. And the idea of forecasting tournaments was to make it easier for people to remember their past states of ignorance. Russ: This is an aside of sorts, but it's just a wonderful insight into human nature, and it's a theme here at EconTalk. Which is: When you went back and asked people to give what they remember as their probability of, say, the Soviet Union falling, what did they say? Guest: Well, they certainly thought they assigned a higher probability to the dissolution of the Soviet Union than they did. And there were a few people who assigned really low probabilities who remember assigning higher than a 50% probability. So, people really pumped up those probabilities retrospectively. So, the psychologists call that 'hindsight bias' or the 'I knew it all along' effect. And we saw that in spades in the Soviet Forecasting Tournament. Russ: Yeah. I think that's an incredibly important thing that we all tend to do. We tend to think we had much more vision than we actually had. And we usually don't write those things down. You happen to have written some of them down. So that was awkward, that they actually had their original forecasts. But most of us, the I-knew-it-all-along problem is a bigger problem for most of us because we don't write it down. Guest: Well, we truly remember it differently. Even if you think the person on the other side of the table knows what the correct answer is, you still tend to misremember it.
9:30	Russ: So, this more recent tournament was rather remarkable. Give us the background of who competed and your role in it and how it was set up. And what some of the questions, for example, were that people were competing on. Guest: Sure. This was work I did jointly with my wife, research collaborator, Barb Mellers, and we were faculty then at the U. of California, Berkeley. And we didn't leave for the U. of Pennsylvania till about 2010. But we were visited by three people from the Office of the Director of National Intelligence when we were at Berkeley, I guess late in 2009. And at least two of them were quite enthusiastic about the idea of the U.S. intelligence community using some of the techniques that were employed in my earlier work, Expert Political Judgment, for keeping score on the accuracy of intelligence analysts' judgments. And that was the core idea behind what became known as the IARPA (Intelligence Advanced Research Projects Activity) Forecasting Tournaments. IARPA is the research and development branch of the Office of the Director of National Intelligence. Which is the umbrella organization over all intelligence agencies like CIA (Central Intelligence Agency) and DIA (Defense Intelligence Agency) and [?] and so forth. And all 16 of them. And the idea would be they would have a competition; and major universities and consulting operations would apply for large contracts to assemble teams whose purpose would be to assign the most realistic probability estimates to possible futures that the U.S. intelligence community deemed to be of national security relevance. So, those turned out to be questions on everything from Sino-Japanese clashes in the East China Sea to, recently, the Eurozone and Spanish bond yield spreads to Russian relations with the [?] Estonia, Ukraine, Georgia. Of course, conflicts in the Middle East; Ebola; H1N1 (flu strain) issues. Just an enormous range of issues. 500 questions over about 4 years. And the goal would be on each of the research operations would be to come up with the best possible ways of assigning probability estimates. Now, they screened everybody for their academic bona fides: they wanted to make sure that everybody was legit; they weren't using Ouija boards or anything like that. But the other thing-- Russ: That would be cheating. Guest: Now the U.S. intelligence community was simply interested in who could generate the most accurate probability estimates for these extremely diverse questions. And they didn't really care whether we took a more psychological approach or more statistical approach or a composite approach. What they cared about was accuracy. And that was it. Accuracy, accuracy, accuracy. So, we--our group, my wife and I put together this group called the Good Judgment Group, which is an interdisciplinary consortium of wonderful scholars. And we went out about--we tried to recruit good forecasters to--and we tried to give them the best possible training in principles of probabilistic reasoning; and we assembled some of them into teams, and we gave them guidance on how teams can work effectively together. And we put some of them into prediction markets and we wanted to see how well prediction markets would work. We experimented with a lot of different approaches. And we also had really good statisticians who experimented with different ways of distilling wisdom from crowds. So, our approach was very experimental. I think some of the other approaches were experimental as well. But our experiments worked out better than their experiments, so we won the tournament by pretty resounding margins in the first two years. Sufficiently resounding that the U.S. intelligence community decided to funnel the remaining money into one big group, which would be the Good Judgment Project, which could hire the best researchers from other teams. Russ: Who were you competing against? Guest: Well, we originally were competing--different competitive benchmarks here. Originally we were competing against the other institutions that received contracts from the government, like, oh, gosh--MIT (Massachusetts Institute of Technology) and the U. of Michigan and George Mason U., places like that. Then later, we were competing against a prediction market that we ourselves were running, a firm known as Inkling; and also against internal benchmarks--U.S. intelligence analysts themselves generating probability estimates and competing against them, although that was classified because of course the U.S. intelligence analysts were classified. But David Ignatius at the Washington Post leaked some of that information, again, I think the second year or third year. Russ: But after two years, your team trounced everybody. And then what happened going forward after that? Guest: Well, we were able to absorb resources from the other teams, because the government was obviously saving a lot of money by suspending the funding of the other teams; and we were able to consolidate some resources. And we were able to compete all the more aggressively against the other remaining benchmarks. The key benchmarks for us to beat were an external benchmark--the prediction market run by Inkling and the more confidential one inside the U.S. government.
14:59	Russ: Now, you mentioned--this is just--well, actually, I'm going to read a quote from the book, which I loved, which is relevant, which is from Galen, the early physician. And what time period did Galen live? Roughly? Guest: I guess he is a second century after Christ--it was roughly 2000 years ago. Russ: Okay. I thought he was later than that. So, he wrote a long time ago, and you write the following: that he wasn't into experiments, and you wrote the following. Here's the quote: Galen was untroubled by doubt. Each outcome confirmed he was right no matter how equivocal the evidence might look to someone less wise than the master. [And here's Galen's quote] "All who drink of this treatment recover in a short time, except those whom it does not help, who all die," he wrote. "It is obvious, therefore, that it fails only in incurable cases." So, what could be better than that? I mean, that's phenomenal; and I think that's where you apply the quote--even the pundit who puts a numerical value on a certain event happening as a 63.7% chance that this will happen, whether it happens or not, if it does happen, he says, 'See, I told you it was a 63.7%' and if it doesn't happen he can say, 'Well, I said there was a 36.3% chance that it wouldn't happen. So when it didn't happen, I'm still right.' So, the question then becomes, when you say you trounced the other teams, there has to be a way to evaluate probabilities; and in the book you present the Brier scores. So, try to give us the flavor of how you measure success in prediction. Guest: Oh, that's an excellent point. It really isn't possible to measure the accuracy of probability judgment of an individual event, unless the person, forecaster, is rash enough to assign a probability of zero and it happens with a probability of 1.0 and it doesn't happen. Otherwise the forecaster can always argue that something improbable happened. So, assessing the accuracy of individual events is impossible, except in those limiting cases. But it is possible to assess the accuracy of many events and across many time periods. So, good judgment in world politics means you are better than other people at assigning higher probabilities to things that happen than things that don't happen across many events, many time periods. Russ: So, the example would be--let's take a particular example. We are going to try to forecast the probability of Greece leaving the Eurozone. So, I say it's .51 and you say it's-- Guest: I say it's-- Russ: Excuse me. Let's do it the other way. Guest: I say it's .15. Russ: No, let's go the other way. Guest: Okay. Russ: I'm going to go .49, because I think it's not likely. Because it's below .5. I say .49 and you say .1, and it doesn't happen. So my argument is that you did a better job than I did. Guest: You don't know that for sure with respect to Grexit [Greek exit]. That's correct. You do know it probabilistically across the full range of questions posed in the [?] tournament. Now, insofar as you've been predicting .49 consistently over several years, and I've been predicting .1, and it doesn't happen, you might be tempted to draw the conclusion even with respect to Grexit that I've been closer to the truth. Russ: You might be. One of the things I've found troubling about the setup and the way of assessing good judgment--and one of the things your book makes one ponder--is just how hard it is to assess whether someone has good judgment. Guest: That's absolutely true. I couldn't agree more. It's a very difficult concept operation-wise. Russ: Yeah. So, this particular way--even though--so let's take this case. Let's say there's 10 things where I tended to predict .45 and you predicted .1, and none of them happen; so that we're both "right" in that we both thought it was below a half--it was less likely. But you were more right than I was, because--because what? And here's what I want you to respond to. It seems to me you could argue you just have more confidence than I did. You were more strategic in how you picked your number. You didn't have any more accurate knowledge of the actual probability. Guest: Well, how many times did you have to flip that coin before you decided that the person who claims the coin is biased is closer to correct than the person who claims the coin is very close to equilibrium? Russ: Well, that's a challenging question. While I was reading the book, I thought of Bill Miller, of Legg Mason. So, Bill Miller beat the S&P500 I think for at least 15 years in a row, maybe more. And a lot of people concluded him to be a genius, because, well, he beat the S&P500. One year, not so impressive. But 15 years--that's so unlikely. But of course we know that doesn't prove he's a genius; it doesn't even prove he's smart. It might merely mean he was lucky. Out of the thousands and tens of thousands of managers of mutual funds, he was the one who happened to beat the S&P500 15 years in a row; and we know that over enough time and enough managers, that's going to happen. And so we know nothing about his ability going forward. And in fact, he didn't do particularly well after his streak was broken. Did he get less smart? Did he get overconfident? We have no way of knowing. So, I find myself--even though I found many things in the book that are useful, and thinking thoughtfully about looking into the future, the fundamental measurement technique strikes me as a challenge. What do you say to that? Guest: I think that is a great question. I say, really, really deep, question. People in Finance argue, of course, about whether there is such a thing as good judgment--if you are a really strong believer in the Efficient Markets Hypothesis, you are going to be very skeptical. If you toss enough coins enough times, a few of them are bound to wind up heads 60, 70, 80 times in a row. If you just keep doing that. And there are skeptics who argue that Bill Miller--or for that matter, Warren Buffett or George Soros--were just one of those lucky sequences of coin flips. And then we anoint them geniuses. We are very sensitive to the possibility that superforecasters could be super-lucky. And we are always open to the possibility that any given super-forecaster has been super-lucky. We are always looking for patterns of regression toward the mean. The more chance there is in a task, the greater the regression-toward-the-mean effect. And that's just something we're continually looking for. Our best estimates are that the geopolitical forecasting tournament sponsored by IARPA had about a 70:30 skill:luck ratio, based on the regression toward the mean effects that we were observing. Which means there's a big element of skill; and there's a significant element of luck. And based on other factors--like we introduce experimental manipulations that reliably improve accuracy. If it were pure noise it wouldn't be possible to do that. It wouldn't be possible to develop training modules or teaming[?] mechanisms that improve accuracy if we were dealing with a radically noisy dependent variable. It is possible to do that. So, various converging lines of evidence with individual difference evidence among the forecasters and experimental evidence suggest that we're not dealing with a radically indeterminate phenomenon here. The is such a thing as good judgment. But there is certainly a significant element of luck as well. Russ: One of the challenges when you read Warren Buffett's or Charlie Munger, his partner's, analysis of the market--they are really smart. They are full of interesting insights. So, it reinforces your view that maybe it's not luck. The challenge of course is that you don't know whether those particular insights really matter. Guest: That's true. Russ: In the universe of things that matter. Guest: Absolutely. We are in complete agreement on this subject.
23:35	Russ: Let's take an example from the book which I found really illuminating, which is an example of how--there is a role for skill, at least in some forecasting problems, some estimation problems. Which is, you give the example of you are told that there's a family; their last name is Renzetti; they have an only child. What are the odds that they have a pet? And talk about how you might think about that more thoughtfully rather than just saying, 'Well, I don't know,' or worse, 'Well, if they have an only child, that's important.' The inside/outside distinction, I found very illuminating. Guest: Well, it's part of a more general discussion in the book about what distinguishes superforecasters from regular forecasters, and that is the tendency of regular forecasters to start with the outside view and gradually work in. So, you would start your initial estimate, whether it's trying to estimate the number of piano tuners in Chicago or whether a particular family has a pet or whether a particular African dictator is likely to survive in power another year--all those kinds of examples, you would start by saying, 'What's the base rate of survival?' or 'What's the base rate of pets?' So, another example is this African dictator problem: We might ask a question about whether Dictator X in Country Y is likely to survive in power for another year. And you might shrug and say, 'I've barely heard of the country, less still the dictator.' But you do know a couple of things. You know more than you think you know. And one of them is that if a dictator has been in power more than a year or two, the likelihood that the dictator being in power another year is very high. It's 80, 90%-plus. So you could know nothing about the dictator or the country. You could say, 'Well, I know once someone has established a power base within a country, it's difficult to dislodge them.' Now, so you would start your estimation process with a high probability because of that fact. Just the simple, demonstrable, statistical fact. Then you would say, 'I'd better do a little bit of research and find out a little about this guy and his country.' If you discover that this person is 91 years old and has advanced prostate cancer, you might want to modify your probability. If you discover there's fighting in the capital, you might want to modify your probability. So it captures part of the distinctive working style of the superforecasters--is that they try to get as much initial statistical leverage on the problem as they can before they delve into the messy historical details. Russ: And I think all of us like the idea of evidence-based medicine, evidence-based forecasting, and your book is certainly a tribute to the potential for data and statistics to help improve our ability to anticipate events that are important. I guess the challenge is: Which evidence and how we incorporate the other factors. One of the--you tell a lot of interesting stories of the way the different forecasters, many of whom are just "amateurs," which is beautiful--they are not burdened by the Ph.D. that I have and that others have who tend to try to predict things. So you talk a lot about how they weigh evidence. The part I find intellectually challenging in accepting these results--there's two issues. One, and I want to make it clear: these amateur teams that you've put together and the experts and the aggregation of folks into teams with advice on how to work together and how to avoid group-think--which is a large part of the book, very, very interesting and very useful I think to anybody--in all of these examples, they dominate. It's not like they do 3 percentage points better than the others. I just want to make that clear, right? They really did a lot better than just some of the more educated folk and the so-called experts, correct? Guest: Well, when you throw everything together, the cumulative advantage does get to be quite staggering over the ordinary folks in the tournament. That's true. But you were talking about different components here. It certainly helps to have talent, and to get the right people on the bus. So, individual differences among--superforecasters are not just regular people. They are different. In measurable ways. They score higher on measures of fluid intelligence; they are more politically knowledgeable; they are more open minded. But most important--I think they have all those advantages over regular folks, and those matter. But I don't think they have those advantages over professional intelligence analysts. I don't think they have greater fluid intelligence; they definitely don't have greater knowledge. And I don't think they are even more open-minded, although they are pretty open minded. I think what really distinguishes the superforecasters from the seasoned professionals in the intelligence community whom they were able to outperform--and that was really I thought the most difficult of all the benchmarks--I think what really distinguishes them is that they believe that subjective probability estimation is a skill that can be cultivated and is worth cultivating. I think many of the sophisticated analysts, like many of the sophisticated pundits, when they see a question like, 'How likely is Greece to leave the Eurozone?' or 'How likely is Putin to try to annex more Ukrainian territory?' they'll shrug and they'll say, 'This is a unique historical event; there's no way we can assign a probability to this. You should have learned this in Statistics 101. You can learn to make probability judgments in poker and things like that; you can learn to distinguish 60:40 bets from 40:60 bets in poker. Because in poker you have repeated play and a well-defined sampling universe. And indeed, the frequent, the statistics everyone learns in Stat 101 apply. Those statistics just don't apply here, so you're engaging in an exercise in pseudo-precision.' You've got people with really high IQs-- Russ: That strikes me as [?] thought-- Guest: saying really smart things like this. And it blocks them from exploring the potential of learning to do it better. Which I think is what the IARPA tournament proved is possible.
30:02	Russ: Now, I want to try to take that criticism: I'll phrase it a little differently. If you asked me--let's predict--there's a football game. We are recording this on a Monday. There's a football game tonight; it's the Cleveland Browns against the Baltimore Ravens, if I remember correctly. It's not a very interesting game. And we want to figure out the probability that Baltimore is going to win. I think they are probably favored, okay? So they are supposed to win, but we know that they might not. So we'd like to know, though, what the probability is. Now, there are many ways to go about this question. The way that people in your book go about it is, they take a base rate--or this is one of the ways--a base rate like we talked about a minute ago: how many dictators have been in office x years, an additional year. Or in the case of the pet example, you didn't mention it but in the book you talk about what's the proportion of households that have pets? That would be a great starting place. And then you dig deeper and you try to find out more stuff. First of all, it's really hard to know what the base rate is, because there's the base rate of underdogs on a Monday night, there's the base rate of teams that have lost two games in a row. So, what then people start to do--and they can do it in football; it's harder, a lot harder to do with Greece exiting the Euro--is they try to accumulate statistical evidence in a systematic way. They run multivariate regressions. And they're pretty good at that--because of the nature of football. We're pretty good at narrowing down. We can look at past performance; we can take account of injuries that can mess things up. We'll never know, as Hayek pointed out in a different context, if the quarterback had an unsettling argument with his wife the night before or a bad meal at lunch that's affecting his play. But in football, we are pretty good at predicting probabilities. But we don't have those tools when we look at, say, Greece exiting. And worse than that: When we do have those tools we often can't do it very well. So, when we try to, say, estimate in epidemiology the effect of drinking a lot of coffee, whether you are more likely to get cancer: We can't measure that. So how are these people somehow absorbing all this information--you talk about how they read a lot and they talk and they share ideas and they bounce ideas off each other when they were doing this--how are they able somehow to home in with an accuracy without using a formal statistical model? Even when we use statistical models we can't do very well. Guest: Well, they are opportunistic, and sometimes they do find statistical models in unlikely places. One of the first places they would go for your football game is they would look at Las Vegas and what the odds are. They would do what you might superficially consider to be cheating: they would say, 'There are some very efficient information aggregators out there. Like there's Nate Silver at FiveThirtyEight and there's this and there's that. I'm going to take a look at each of those, and I'm going to average those and I'm going to take that as my initial estimate. And then if I know something about the quarterback's relationship with his spouse, I might factor that in, too, but I'm not going to give very much weight to it.' And that turns out to be a pretty good strategy. You are raising a deep philosophical question about the limits of precision. And I think it's just wonderful--this is one of the best interviews I've had, I think. It is a deep question: why don't we just say I don't know the answer to what the limits of precision are. You don't know what the answer is. Why don't we run studies like the IARPA tournament and find out where they are? And that's [?] what IARPA did--adopted a very pragmatic attitude and said we can get you to hunker down in philosophical positions and I could say I'm a Bayesian, you're a frequentist, and I think we can [?] probability estimations here; and you think, no, there's just too much noise and there's not enough [?] opportunities. We could argue about that until the cows come home. But really the right thing to do here is to run forecasting tournaments and explore what the limits of precision are. And there are real limits of precision in the IARPA tournament. I mean, the very best forecasters on average are not doing much better than assigning a 75% probability for things that happen and then 25% probability for things that don't. So there's still a lot of residual uncertainty here. There are big pockets of uncertainty, and in the tournament there's lots of room for error. And they make lots of errors. What we simply showed is that it's possible to do something that very smart people previously supposed was pretty impossible.
34:44	Russ: Let me ask a different question. I guess there are two issues related to that empirical finding. One would be: could you do it again? Right? It would be a question of replication. Could you replicate the success with the same team; do you think they would continue to outperform the benchmarks? That would be the first question. The second question is what do you do with it? Well, I'll let you answer the first one first. Is there any plan to try to replicate these results? Guest: Well, we're doing that. IARPA's doing exactly the right thing: they are setting up a mechanism for exploring how applicable these results are. So, we're going to be running more forecasting tournaments. One of the reasons I invited your readers to participate in GJOpen is they can explore their skills. And who knows--there might be some superforecasters listening right now. Russ: Hear, hear. And then the next question would be: Is this really valuable? So, you have to have contingencies; often, you want to have contingency plans for the contingencies that could happen. So you want to know whether it's really like that Greece will leave the Eurozone, or you want to know whether China's going to do such-and-such militarily; you want to know whether there will be a coup in this or that country. But does it affect our actions to know that it's really 73% rather than 58%? So, what's the consequences--do you believe that improving those probabilities are going to lead to better policy? Guest: You know, again--at the risk of flattering the interviewer, that's just a superb question. It depends on the domain. If we were talking about pricing futures options on oil, I think Wall Street professionals would say 'Yeah. I would really want to know the difference between a 60:40 probability and a 40:60 probability.' Aaron Brown, the Chief Risk Officer of AQR (Applied Quantitative Research) said as much when we interviewed him for the book. And I know that's a common attitude for people in the hedge fund world. So in that world, options pricing and finance, I don't think there's much question about it. Poker, I don't think there's much question about it. Now, we had an opportunity to talk to a senior official in the intelligence community about the project a year or so ago, and we asked, 'Well, if you had known that the probability of Russian incursion into the Ukraine was not 1% but it was 20% during the Sochi Olympics, would you have done something different?' and we got an interesting reaction. It was: 'I've never even heard a question like that before.' Russ: That's fascinating. Guest: And it is a fascinating problem. It raises very deep questions. I think the short answer is: Everybody would agree you are not worse off with probability estimates than worse probability estimates. In the long run. I don't think there's any--The question would be: Are the increments, improvements in accuracy we're able to achieve, do they translate into enough better decisions in a given domain to justify the cost of achieving those improvements? I think that would be your question, in the intelligence context. And that is I think something that the intelligence community is quite sensibly exploring right now. Russ: I think the other question is: You might lead yourself down a path of thinking you've got more certainty than you actually do. So there's a downside risk, also, of using a more organized method. Even though we all want to, we all think that's got to be better, it doesn't have to be. Unfortunately. Guest: Right. But the opposite error is also possible. Psychologists, you are right, tend to emphasize the dangers of overconfidence. But there is also the danger of underconfidence. So we talk about the situation in which President Obama was making a decision about whether to go after Osama bin Laden, and he was probably--he drew an underconfident conclusion, we think, from the probability judgments that were offered to him in that room. When you have people with different expertise and different points of view all offering probabilities--almost all of them offering probabilities about 50%, what's the right way to process those probabilities? Do you simply take the median? Or should you take something more extreme than the median? And that was one of the issues our statisticians wrestled with. Treat it as a thought experiment. In the thought experiment, you are the president of the United States. Around you have a table of elite advisers, each of you offers you a probability estimate that Osama bin Laden is residing in a mystery compound in Abadabad, Pakistan. And each one says, 'Mr. President, I think there's a .7% probability.' And the next one, 0.7. 0.7. All around the table, uniform .7. What conclusion should the President of the United States draw about whether Osama is there and whether to consider going to the next step of launching a Navy Seal attack? Well, the short answer is, it depends on whether the advisers are clones with each other or not. If they are clones of each other, the answer is 70%. They are all drawing on the same information, processing it in the same ways. 70%. There is no incremental information provided by each 70%. But if they are drawing on different types of evidence and processing it in different ways--there's one guy with satellite information, another is a code-breaker, another is human intelligence, and so forth--if they are drawing on different sets of information, processing it in different ways, each of them still arriving at 70% but not knowing the information the other people had when they reached their 70%--now, what's the correct probability? And the answer to the question I've just posed is mathematically indeterminate but it's statistically estimatable. And we did statistically estimate it, over and over again, during the course of the IARPA tournament. It was one of the big drivers of our forecasting success. And typically in the IARPA tournament you would extremize. You'd move from 70% to 85 or 90%. There you know more than you think you did. Russ: Explain. When you said you'd 'move,' I didn't understand. You are saying that's what you would discover? Explain that. Guest: Well, it's a question of who your advisers are. If your advisers are all drawing on the same information and reaching 70%, when you average their judgment it's going to be 70%. Russ: You only really have one estimate. And you are fooling yourself if you think you have 10. Guest: That's right. But if the advisers are all saying 70% but they are drawing on diverse sources of information, it's a little counterintuitive but the answer is going to be quite a bit more extreme than 70%. Now, how much more extreme is going to be a function of how diverse the types of information there are and how much expertise is in the room. And these are difficult-to-quantify things, for sure. What our statisticians did they used an extremizing algorithm that simply--when the weighted average of the best forecasters was tilted on one side of maybe or another, they extremized it: they moved from 30% down to 15, or from 70% up to 85. Russ: You are saying when they had that average estimate of 70% they actually--they were--they pretended it was higher. They gave it more confidence than just the 10 because they drew on different information. Guest: Well, our statisticians--we submitted forecasts at 9 a.m. Eastern time every day during the forecasting tournament. So there's no wiggle room here. I mean, this is very carefully monitored research, right? This doesn't have the problems of some research where you know people can have wiggle room. There's no wiggle room here. This is being run like a bank, with transaction, every day. And we were betting on those aggregation algorithms. And that particular extremizing algorithm I'm describing in verbal terms right now, was essentially the forecasting tournament winner. It was more accurate than 99% of the individual superforecasters--from whom the algorithm itself was largely derived. Russ: Yeah. I just want to emphasize again that going back to our earlier discussion, is that it didn't really mean that the number was 0.8 or 0.9. I'm not sure that's a meaningful number. Just that you were more confident that it was Osama bin Laden, say, than the 0.7 number suggested. That would give some comfort to the President in launching a Seal attack. Another way to say it would be that even though they all thought it was more likely than not, .7, if it came from different sources of information you could be more confident it was more likely than not. It was even closer to certain. Guest: Exactly right.
43:45	Russ: I want to come back to the wisdom of crowds in a minute and the aggregation issue. But since we are talking about a president making a decision, you have some interesting thoughts on how a leader balances humility with confidence. And I find--we talk about this on our program a lot: I'm skeptical, but sometimes I'm too skeptical. I need to be more skeptical about my skepticism, because I have trouble accepting things that might be true that go against my skeptical beliefs. So, you deal with that in the book: that a leader--most leaders are not very skeptical. They seem to be bold. Winston Churchill would be a quintessential example you mention. They don't say, 'Well, it could be 73 or 80. They say, 'Welp, we've got to move forward.' Talk about this issue of balancing humility and overconfidence. And confidence, I should say. Guest: Yeah, that's--this is a topic, if I were to write a sequel book it's one that I would very much want to feature prominently in the book. Well, let's use a sports analogy. I'm not a big sports fan, but my co-author, Dan Gardner, is a big hockey fan. And he's a Canadian. And I was actually born in Canada myself, but I'm U.S. naturalized. But the Ottawa Senators were apparently in a Stanley Cup final one year, and they were down 3-to-1 in the series. Russ: Best of 7 series. Guest: Best of 7 series, right. And some reporter thrust a microphone in front of the coach's mouth and said, 'Hey, Coach, you think you've got a chance?' And the coach, instead of doing what coaches are supposed to do and say of course, we're going kick butt; we're going to do it-- Russ: We just won three in a row; we've done it before; we'll do it again-- Guest: he went into superforecaster mode. And he said, 'Well, what's the base rate of success for teams that are down 3-to-1? Doesn't look very good, does it?' Russ: It's a long shot. Guest: It's a long shot. This is not what coaches or leaders are supposed to do. And it raises the question of what are the conditions under which leaders are supposed to be liars. Russ: Yeah. And what's the answer? Guest: Well, that's why I said the next book, Russ. We do talk about it in the Superforecasting book at some length. And we have some interesting, I think military examples and some other examples as well, of where, you know, the need for leadership and the need for confidence are in some degree of tension: the need for circumspection and the need for confidence are in tension with each other. Russ: Yeah. I'm--it's always fascinated me how hard it is for a leader to say, ex post, 'I made a mistake.' Or a pundit to say, 'I made a mistake.' Most of them don't. They hedge. They say, 'Well, I had that in mind. And here's the words that suggest that I knew that.' Or, 'I didn't know this piece of information. If I'd known that, of course I wouldn't have--dah, dah, dah.' Or my favorite: 'It wasn't a mistake. Everyone else thinks it's a mistake. They're wrong! It was a great thing.' So, you get the whole range. I think there's a--you are a psychologist. I think the psychological challenge of admitting a mistake and being--you know, it's one thing to say, 'Well, I know it's a long shot but I won't say it because that would be bad for the team.' But I suspect a lot of great leaders don't even think that it's a long shot. They just say, 'We're going to win.' And they actually believe it. Guest: Right. And there's a question about whether you'd prefer to have a leader who believed--who was capable of self-deception--or a leader who was capable of being two-faced and has one set of private numbers and a set of public numbers. Russ: Yeah. For sure.
47:27	Russ: Let's talk about the wisdom of crowds, which you referred to a few times in the book. And we implicitly talked about it a minute ago. Talk about how you aggregated folks and how you avoided the cloning problem or the group-think problem in your estimates. Guest: There was a big argument in our research group early on about whether it would be good--better for our forecasters to work as individuals or to work as teams. And the anti-team faction correctly pointed to the dangers of group-think and all the other dysfunctions of--anyone who has ever worked in a team knows how bad teams can be. Russ: Bullying-- Guest: All of the above. And then there was another group that said, 'Look, there are conditions under which teams can be more than the sum of their parts; and if we give them the right guidance on how to work as a team, they can deliver some great stuff.' And we resolved it by running an experiment. And it turned at the end of the first year that teams were better. Significantly better. How much better? Maybe 10%. What we are talking about is many small factors and few big ones that cumulatively produce a really big, big advantage for the superforecaster teams. So, superforecasters do better because they have certain natural or/and acquired talent advantages. They do better partly because they work in a cognitively enriched environment with other superforecasters. They do better partly because we've given them a lot of training and guidance about how to do probability estimation, and they've taught each. And for that matter, the superforecasters have taught us things. So our training has become better by virtue of the feedback from the superforecasters. And then, finally, they do better because of the algorithms. Russ: Yeah. One thing I want to make clear, which we didn't stress enough: These folks who are doing this--and as you said, there are a lot of forecasts and they do it on an ongoing basis--these are not people doing this as a full-time job. And these aren't professors of political science forecasting what's going to happen in the South China Sea. These are just really smart everyday people who are doing this on the side. Correct? Guest: Well, I wish we had more professors of political science. We have a few. But we don't have is what I'd hope-- Russ: Some of my best friends. Some of my best friends are professors in political science, I should add that. But go ahead. Guest: Some of mine, too. Right. Well, who are the superforecasters? So, the media like to focus on the superforecasters who are the most counterintuitive. So, there's Angela Kenney who is a housewife in [?], Alaska. And there's a social case worker in Pittsburgh. And there is a person who works as a pharmacist in Maryland. Other superforecasters work as analysts on Wall Street. Or were previously analysts in the intelligence community. Or work in Silicon Valley. Or are really adept software programmers who develop interesting tools for helping people decide which tools to focus on and how to winnow[?] media sources and so forth. So, superforecasters are really quite varied. Some of them, more like your stereotype of what you'd expect a superforecaster to look like. Some Silicon Valley, Wall Street, IQ of 180-type. And others look a lot more like intelligent thoughtful citizens who you run across in everyday life. And it's interesting how they get along. It's actually a wonderful dynamic to behold, in the super-teams. They are a diverse group. And they're very clever at working out what their sources of comparative advantage are in dealing with problems and allocating labor. They create an effect--many organizations. In effect what they created were mini-intelligence agencies. That were generating probability estimates more accurate than were coming out of many intelligence analysts. Russ: You said you gave them advice. You didn't just throw them into teams and say, 'Good luck! Work it out!' You did some very thoughtful things that the book describes to get them to perform effectively as teams rather than as clones or a group thing. Guest: We did. We did. Because there's always a tension in groups. To get to the truth in groups you often have to ask questions that might offend people a little bit. So, mastering the art of disagreeing without being disagreeable, and mastering what the art of what some consultants in California call precision questioning--we found that to be a very useful tool to transfer to all of our teams, our regular forecasting teams and superforecasting teams. Because we have thousands of forecasters, many experimental conditions here. Russ: Give us a 2-sentence of precision questioning. What is that? Guest: Well, when someone makes a claim like, [?] is declining in popularity as the world's most popular pastime, you would want to figure out what exactly they mean by key terms--by whether the things that are included in pastimes--what do they mean by 'decline'? You want to get them to be more specific than people normally are. And when you start probing people, they are often unable to become more specific. And they often--then they feel, with a big probe, they feel irritated: 'Quit bugging me' about this. So, superforecasting teams have learned to push the limits of precision but maintain reasonable etiquette inside the group. And I think that's crucial for getting this, you know, more underlying question you've been raising throughout the conversation, which is: What are the limits of precision? When does precision become pseudo-precision? Russ: When you start using-- Guest: And you don't need--know until you test it. Russ: Well, the simple answer is when you start using decimal points.Guest: 73.2--well, you know, I think that's right. I think when you look at how many degrees of uncertainty--let's say for the sake of argument that we treat the probability scale as having 100 points along is rather than having infinitely divisible--let's just say it's a 100 point probability scale, which is the one we actually use in the tournament: How many degrees of uncertainty were superforecasters collectively usefully distinguishing when they make their forecast? You can estimate that statistically by rounding off their forecast a tenth or so forth. Russ: Right. Guest: And I think our best estimate is they can distinguish somewhere between about 15 and 20 degrees of uncertainty, along a probability scale. Most people distinguish about 5, 4. Or somewhere between 4 and 5.
54:05	Russ: We're almost out of time. I want to get to an economics issue that you raise in the book, which is often on my mind. Which is--I'm going to couch it the way listeners here would expect--which is: We passed this enormous, seemingly enormous--you can debate whether it was enormous or not because there's always a debate after the fact of whether the base conditions held. We passed a seemingly enormous stimulus package to fight the recession. And there were some predictions made, sort of, about what that would achieve. And there were people on one side of the fence who said it was going to end unemployment over a certain period of time. Other people said it was going to make things worse. A lot of people just said, 'Oh, I really like it,' or 'I really don't,' without making any kind of even beginnings of a quantitative prediction. But then the dust settled; and afterward, everybody said, 'I was right.' On either side of the debate. Guest: That's familiar. Russ: And I find it deeply troubling that, in economics in particular, but it's elsewhere, that there's no accountability. And if there is no accountability, why do we even begin to pay attention? If there is no authoritative way, even mildly authoritative way to assess whether a prediction is accurate, whether a model is accurate, whether a policy prescription is fulfilled--how can we make any progress? And I don't see it in my profession. You suggest the possibility of some ways we might hold people's feet to the fire and at least have some accountability. How might that work? Guest: I think you might be referring to the proposal of adversarial collaboration tournaments-- Russ: Yes. Guest: And we use the example of, Niall Ferguson and Paul Krugman and the debate over Quantitative Easing. Russ: Yup. Guest: Right. Well, I think it's a great model. It's--it has some utility in science; I think it has some utility in public policy debates. We're running an Iranian nuclear accord tournament now on GJOpen. Here's one of the key things you would do. Each side would have a chance to nominate 5 or 10 questions that it thinks it has a comparative advantage in answering. The questions have to be relevant to the underlying issue and they have to pass the clairvoyance test--which means that they have to be rigorously scorable[?] for accuracy after the fact. And victory has a pretty clear meaning in this kind of context. It means you not only can answer my question--you not only can answer your question better than I can--you can answer my questions better than I can. And that leaves me in an awkward situation. Because I can't simply say, 'Well, you posed some stupid questions.' I would have to say, 'Well, my questions were stupid, too.' It's much more awkward. Now, of course, pundits are going to be very reluctant to engage in a game like this. I mean, why would you want to engage in a game where it's all--yes, the best possible outcome is not nearly enough to justify the risk? The only way we are ever going to induce high-status pundits to agree to participate in level playing-field forecasting tournaments in which they pit their predictions about the future against their competitors is if the public demands it. And if there is a groundswell demand for that, if pundits feel that their credibility is beginning to suffer because they are refusing to offer more precise and testable predictions vis-a-vis their competitors, then I think that would be the only force on earth capable of inducing them to do it. Russ: Yeah. I think there's a shame factor. I think an external source, maybe this program, could shame some people into participating. But it's an interesting question. One of the challenges, I think, in economics--and it's somewhat of a question I think when you think about it carefully in other fields as well--is: What are you measuring? So, if what we really care about, say, is not the whole picture, but if we are worried about whether the minimum wage causes unemployment, one reason people would say, 'Well, I can't participate in that, because there are so many other factors besides an increase in minimum wage. I can't guarantee that they are not going to be in place.' Guest: Well, it's one[?] of probability. Russ: What? Guest: We just want a probability. Russ: Yeah. But I just want to say, if you can't, then you should shut your mouth. Because you are just talking. And I do it, too. I shouldn't say it's just them. But I don't pretend mine is scientific in the way that they sometimes do with empirical data. I just--I'm trying to rely on my principles, which I think are pretty reliable. But I'm probably at risk of fooling myself there, too. Guest: I think that minimum wage would be a wonderful example where adversarial collaboration could work, because there are so many states and municipalities taking independent action on that front. Russ: Yeah. It may be something I can enable, if I play my cards right.

Time

Podcast Episode Highlights

0:33

Intro. [Recording date: November 30, 2015.] Russ: You start with a lot of criticisms, throughout the book I'd say, you have a lot of criticisms of pundits. Some of those have Ph.D.'s and some are journalists and some are just so-called experts, who make predictions. But it turns out a lot of those, you can't really hold their feet to the fire when it comes time to judge whether their predictions are accurate or not, are they good forecasters or not. And why is that? What's the challenge with our sort of day-to-day world where people claim that something is going to happen and then print it in the newspaper? Guest: Well, the pundits of whom you say we're critical, you are probably thinking of people like Tom Friedman or Niall Ferguson, people on the left or people on the right, we identify all sorts. They are all pretty uniformly very smart people. They are very articulate; they are very knowledgeable. They offer, make many observations about world politics and economics that seem very insightful. It is extremely difficult, however, to gauge the degree to which their assessments of possible futures, the consequences of going down one policy path or another, are correct or incorrect because they rely almost exclusively on what we call vague verbiage forecasting--they don't say that there's a 20% likelihood of something happening or an 80% likelihood of something happening. They say things like, 'Well, there's a distinct possibility that there will be global deflation in 2016.' Now, when you ask people what ' distinct possibility' could mean, it could mean anything from about 20% to 80% probability, depending on the mood they are in when they are listening. Russ: I didn't mean to suggest you are critical of them, although you sometimes are. But, you are critical of our culture that takes these vague pronouncements and then there's a gotcha game that gets played by people on the other side. But of course there's always a way to weasel out of it because there's usually some hedging in that verbiage. Correct? Guest: Well, that's right. If you exist in a blame-game culture in which people are going to pounce on you whenever you make an explicit fallibility[?] judgment that appears to be on the wrong side of 'maybe,' it's pretty rational to retreat into vague verbiage. So, we talk in the book about a brilliant journalist, a New York Times journalist, David Leonhardt, who created the Upshot, a quantitative column in the New York Times; and he wrote a piece back in, I guess it was 2011 or 2012, when the Supreme Court narrowly upheld Obamacare by a 5-4 margin. And prediction markets had been putting a 75% probability on the law being overturned. And David Leonhardt, who doesn't have any grudge against prediction markets as far as I know, concluded that the prediction markets got it wrong. Now, that's a harsh judgment on the prediction markets because they make hundreds of predictions on hundreds of different issues over years, and they are not bad. When they say there's a 75% likelihood of something happening, it's pretty close to a 75% likelihood. Which means that 25% of the time it doesn't happen. So if you are going to throw out a very well-calibrated forecasting system every time it's on the wrong side of 'maybe,' you are not going to have any well-calibrated forecasting systems at your disposal. Russ: I would say that's a second problem, really, which is that even when you do quantify your prediction, by definition you are allowing the possibility that it doesn't happen. And then the question is: How do you assess the accuracy or judgment of the person who makes a statement like that? Guest: Yes. Exactly. And that requires some understanding of probability. And some patience and some willingness to look at track records over time.

4:57

Russ: So, let's begin with your particular track record. You've done a lot of research in this area, on this question of whether prediction is possible, how accurate is it, are experts good at forecasting? Talk about your background. We're going to get to the tournament that's at the heart of your book, but I want to start with your research history and what you found in the past and how people reacted to it. Guest: Well, I guess that's another way of asking just exactly how old must I be. Because I've been doing longitudinal forecasting tournaments for a long time. So, let's just put on the table: I'm 61 years old and I got started at this right after I got tenure at the U. of California at Berkeley. And I was a little more than 30 years old--it was 1984. And the Soviet Union still existed; Gorbachev had yet to become General Secretary of the Communist Party of the Soviet Union. And we did our initial pile of studies back in the mid-1980s when people, hawks and doves, were arguing about the best ways of dealing with the Soviet Union. And now we're doing forecasting tournaments as hawks and doves are arguing about the best ways of dealing with the Iranian nuclear program. Or for that matter, for dealing with Russia and the Ukraine. So, we've been running forecasting tournaments off and on for 30-plus years. The first big set of forecasting tournaments were done in the late 1980s and the early 1990s and were reported in a book, Expert Political Judgment, that came out in 2005. And the second wave of forecasting tournaments were much larger, involving many thousands of forecasters, a million plus forecasts, and were sponsored by the U.S. intelligence community. And they ran from 2011 to 2015 and in fact they are still running. So if your readers are interested in signing up for an ongoing forecasting tournament they should[?] visiting the website at gjopen.com. Russ: Going back to the earlier work that you did and before the fall of the Soviet Union, what were some of the main empirical takeaways from that work? Guest: Well, one big takeaway was that--liberals and conservatives had very different policy prescriptions, and they had very different conditional forecasts about what would happen if you went down one policy path or another. And that nobody really came close to predicting the Gorbachev phenomenon. Nobody, for that matter, came really close to predicting the disintegration of the Soviet Union later on. But everyone after the fact seemed to have an explanation that either appropriated credit or deflected blame. Russ: And it was consistent with their worldview, I'm sure. Guest: And meshed perfectly with their prior worldview. So, it was as though we were in an outcome-irrelevant learning situation. It didn't really matter what happened--people would be in an excellent position to interpret what happened as consistent with their prior views. And the idea of forecasting tournaments was to make it easier for people to remember their past states of ignorance. Russ: This is an aside of sorts, but it's just a wonderful insight into human nature, and it's a theme here at EconTalk. Which is: When you went back and asked people to give what they remember as their probability of, say, the Soviet Union falling, what did they say? Guest: Well, they certainly thought they assigned a higher probability to the dissolution of the Soviet Union than they did. And there were a few people who assigned really low probabilities who remember assigning higher than a 50% probability. So, people really pumped up those probabilities retrospectively. So, the psychologists call that 'hindsight bias' or the 'I knew it all along' effect. And we saw that in spades in the Soviet Forecasting Tournament. Russ: Yeah. I think that's an incredibly important thing that we all tend to do. We tend to think we had much more vision than we actually had. And we usually don't write those things down. You happen to have written some of them down. So that was awkward, that they actually had their original forecasts. But most of us, the I-knew-it-all-along problem is a bigger problem for most of us because we don't write it down. Guest: Well, we truly remember it differently. Even if you think the person on the other side of the table knows what the correct answer is, you still tend to misremember it.

9:30

Russ: So, this more recent tournament was rather remarkable. Give us the background of who competed and your role in it and how it was set up. And what some of the questions, for example, were that people were competing on. Guest: Sure. This was work I did jointly with my wife, research collaborator, Barb Mellers, and we were faculty then at the U. of California, Berkeley. And we didn't leave for the U. of Pennsylvania till about 2010. But we were visited by three people from the Office of the Director of National Intelligence when we were at Berkeley, I guess late in 2009. And at least two of them were quite enthusiastic about the idea of the U.S. intelligence community using some of the techniques that were employed in my earlier work, Expert Political Judgment, for keeping score on the accuracy of intelligence analysts' judgments. And that was the core idea behind what became known as the IARPA (Intelligence Advanced Research Projects Activity) Forecasting Tournaments. IARPA is the research and development branch of the Office of the Director of National Intelligence. Which is the umbrella organization over all intelligence agencies like CIA (Central Intelligence Agency) and DIA (Defense Intelligence Agency) and [?] and so forth. And all 16 of them. And the idea would be they would have a competition; and major universities and consulting operations would apply for large contracts to assemble teams whose purpose would be to assign the most realistic probability estimates to possible futures that the U.S. intelligence community deemed to be of national security relevance. So, those turned out to be questions on everything from Sino-Japanese clashes in the East China Sea to, recently, the Eurozone and Spanish bond yield spreads to Russian relations with the [?] Estonia, Ukraine, Georgia. Of course, conflicts in the Middle East; Ebola; H1N1 (flu strain) issues. Just an enormous range of issues. 500 questions over about 4 years. And the goal would be on each of the research operations would be to come up with the best possible ways of assigning probability estimates. Now, they screened everybody for their academic bona fides: they wanted to make sure that everybody was legit; they weren't using Ouija boards or anything like that. But the other thing-- Russ: That would be cheating. Guest: Now the U.S. intelligence community was simply interested in who could generate the most accurate probability estimates for these extremely diverse questions. And they didn't really care whether we took a more psychological approach or more statistical approach or a composite approach. What they cared about was accuracy. And that was it. Accuracy, accuracy, accuracy. So, we--our group, my wife and I put together this group called the Good Judgment Group, which is an interdisciplinary consortium of wonderful scholars. And we went out about--we tried to recruit good forecasters to--and we tried to give them the best possible training in principles of probabilistic reasoning; and we assembled some of them into teams, and we gave them guidance on how teams can work effectively together. And we put some of them into prediction markets and we wanted to see how well prediction markets would work. We experimented with a lot of different approaches. And we also had really good statisticians who experimented with different ways of distilling wisdom from crowds. So, our approach was very experimental. I think some of the other approaches were experimental as well. But our experiments worked out better than their experiments, so we won the tournament by pretty resounding margins in the first two years. Sufficiently resounding that the U.S. intelligence community decided to funnel the remaining money into one big group, which would be the Good Judgment Project, which could hire the best researchers from other teams. Russ: Who were you competing against? Guest: Well, we originally were competing--different competitive benchmarks here. Originally we were competing against the other institutions that received contracts from the government, like, oh, gosh--MIT (Massachusetts Institute of Technology) and the U. of Michigan and George Mason U., places like that. Then later, we were competing against a prediction market that we ourselves were running, a firm known as Inkling; and also against internal benchmarks--U.S. intelligence analysts themselves generating probability estimates and competing against them, although that was classified because of course the U.S. intelligence analysts were classified. But David Ignatius at the Washington Post leaked some of that information, again, I think the second year or third year. Russ: But after two years, your team trounced everybody. And then what happened going forward after that? Guest: Well, we were able to absorb resources from the other teams, because the government was obviously saving a lot of money by suspending the funding of the other teams; and we were able to consolidate some resources. And we were able to compete all the more aggressively against the other remaining benchmarks. The key benchmarks for us to beat were an external benchmark--the prediction market run by Inkling and the more confidential one inside the U.S. government.

14:59

Russ: Now, you mentioned--this is just--well, actually, I'm going to read a quote from the book, which I loved, which is relevant, which is from Galen, the early physician. And what time period did Galen live? Roughly? Guest: I guess he is a second century after Christ--it was roughly 2000 years ago. Russ: Okay. I thought he was later than that. So, he wrote a long time ago, and you write the following: that he wasn't into experiments, and you wrote the following. Here's the quote:

Galen was untroubled by doubt. Each outcome confirmed he was right no matter how equivocal the evidence might look to someone less wise than the master. [And here's Galen's quote] "All who drink of this treatment recover in a short time, except those whom it does not help, who all die," he wrote. "It is obvious, therefore, that it fails only in incurable cases."

So, what could be better than that? I mean, that's phenomenal; and I think that's where you apply the quote--even the pundit who puts a numerical value on a certain event happening as a 63.7% chance that this will happen, whether it happens or not, if it does happen, he says, 'See, I told you it was a 63.7%' and if it doesn't happen he can say, 'Well, I said there was a 36.3% chance that it wouldn't happen. So when it didn't happen, I'm still right.' So, the question then becomes, when you say you trounced the other teams, there has to be a way to evaluate probabilities; and in the book you present the Brier scores. So, try to give us the flavor of how you measure success in prediction. Guest: Oh, that's an excellent point. It really isn't possible to measure the accuracy of probability judgment of an individual event, unless the person, forecaster, is rash enough to assign a probability of zero and it happens with a probability of 1.0 and it doesn't happen. Otherwise the forecaster can always argue that something improbable happened. So, assessing the accuracy of individual events is impossible, except in those limiting cases. But it is possible to assess the accuracy of many events and across many time periods. So, good judgment in world politics means you are better than other people at assigning higher probabilities to things that happen than things that don't happen across many events, many time periods. Russ: So, the example would be--let's take a particular example. We are going to try to forecast the probability of Greece leaving the Eurozone. So, I say it's .51 and you say it's-- Guest: I say it's-- Russ: Excuse me. Let's do it the other way. Guest: I say it's .15. Russ: No, let's go the other way. Guest: Okay. Russ: I'm going to go .49, because I think it's not likely. Because it's below .5. I say .49 and you say .1, and it doesn't happen. So my argument is that you did a better job than I did. Guest: You don't know that for sure with respect to Grexit [Greek exit]. That's correct. You do know it probabilistically across the full range of questions posed in the [?] tournament. Now, insofar as you've been predicting .49 consistently over several years, and I've been predicting .1, and it doesn't happen, you might be tempted to draw the conclusion even with respect to Grexit that I've been closer to the truth. Russ: You might be. One of the things I've found troubling about the setup and the way of assessing good judgment--and one of the things your book makes one ponder--is just how hard it is to assess whether someone has good judgment. Guest: That's absolutely true. I couldn't agree more. It's a very difficult concept operation-wise. Russ: Yeah. So, this particular way--even though--so let's take this case. Let's say there's 10 things where I tended to predict .45 and you predicted .1, and none of them happen; so that we're both "right" in that we both thought it was below a half--it was less likely. But you were more right than I was, because--because what? And here's what I want you to respond to. It seems to me you could argue you just have more confidence than I did. You were more strategic in how you picked your number. You didn't have any more accurate knowledge of the actual probability. Guest: Well, how many times did you have to flip that coin before you decided that the person who claims the coin is biased is closer to correct than the person who claims the coin is very close to equilibrium? Russ: Well, that's a challenging question. While I was reading the book, I thought of Bill Miller, of Legg Mason. So, Bill Miller beat the S&P500 I think for at least 15 years in a row, maybe more. And a lot of people concluded him to be a genius, because, well, he beat the S&P500. One year, not so impressive. But 15 years--that's so unlikely. But of course we know that doesn't prove he's a genius; it doesn't even prove he's smart. It might merely mean he was lucky. Out of the thousands and tens of thousands of managers of mutual funds, he was the one who happened to beat the S&P500 15 years in a row; and we know that over enough time and enough managers, that's going to happen. And so we know nothing about his ability going forward. And in fact, he didn't do particularly well after his streak was broken. Did he get less smart? Did he get overconfident? We have no way of knowing. So, I find myself--even though I found many things in the book that are useful, and thinking thoughtfully about looking into the future, the fundamental measurement technique strikes me as a challenge. What do you say to that? Guest: I think that is a great question. I say, really, really deep, question. People in Finance argue, of course, about whether there is such a thing as good judgment--if you are a really strong believer in the Efficient Markets Hypothesis, you are going to be very skeptical. If you toss enough coins enough times, a few of them are bound to wind up heads 60, 70, 80 times in a row. If you just keep doing that. And there are skeptics who argue that Bill Miller--or for that matter, Warren Buffett or George Soros--were just one of those lucky sequences of coin flips. And then we anoint them geniuses. We are very sensitive to the possibility that superforecasters could be super-lucky. And we are always open to the possibility that any given super-forecaster has been super-lucky. We are always looking for patterns of regression toward the mean. The more chance there is in a task, the greater the regression-toward-the-mean effect. And that's just something we're continually looking for. Our best estimates are that the geopolitical forecasting tournament sponsored by IARPA had about a 70:30 skill:luck ratio, based on the regression toward the mean effects that we were observing. Which means there's a big element of skill; and there's a significant element of luck. And based on other factors--like we introduce experimental manipulations that reliably improve accuracy. If it were pure noise it wouldn't be possible to do that. It wouldn't be possible to develop training modules or teaming[?] mechanisms that improve accuracy if we were dealing with a radically noisy dependent variable. It is possible to do that. So, various converging lines of evidence with individual difference evidence among the forecasters and experimental evidence suggest that we're not dealing with a radically indeterminate phenomenon here. The is such a thing as good judgment. But there is certainly a significant element of luck as well. Russ: One of the challenges when you read Warren Buffett's or Charlie Munger, his partner's, analysis of the market--they are really smart. They are full of interesting insights. So, it reinforces your view that maybe it's not luck. The challenge of course is that you don't know whether those particular insights really matter. Guest: That's true. Russ: In the universe of things that matter. Guest: Absolutely. We are in complete agreement on this subject.