Russ Roberts

Campbell Harvey on Randomness, Skill, and Investment Strategies

EconTalk Episode with Campbell Harvey
Hosted by Russ Roberts
PRINT
Continuing Education... Paul R... Continuing Conversation... Cam...

Campbell Harvey of Duke University talks with EconTalk host Russ Roberts about his research evaluating various investment and trading strategies and the challenge of measuring their effectiveness. Topics discussed include skill vs. luck, self-deception, the measures of statistical significance, skewness in investment returns, and the potential of big data.

Size:29.9 MB
Right-click or Option-click, and select "Save Link/Target As MP3.

Readings and Links related to this podcast episode

Related Readings
HIDE READINGS
About this week's guest: About ideas and people mentioned in this podcast episode:

Highlights

Time
Podcast Episode Highlights
HIDE HIGHLIGHTS
0:33Intro. [Recording date: March 6, 2015.] Russ: Our topic for today is in some sense randomness, one of the deep ideas in thinking about complexity and causation. As a jumping off point, though, we're going to use a recent paper you wrote with Yan Liu, "Evaluating Trading Strategies," which was published in the Journal of Portfolio Management. And we may get into some additional issues along the way. Let's start by reviewing the standard way that we evaluate statistical significance in economics for example or other applications of regression analysis. You'll hear people talk about a t-statistic being greater than 2. And what does that represent? What are we trying to measure there? What are we trying to assess when we make a claim about significance of, say, one variable on another? Guest: So, the usual procedure--we actually think about trying to minimize the chance that a finding is actually a fluke. And it comes down to a concept called the p-value or probability value. And what we usually try to do is to have 95% confidence that the finding is actually a true finding. And by definition, then, there's a 5% chance that the finding is a fluke. And when you do that in standard sort of statistic analysis, that is a so-called 'two sigma' (2-σ) type of rule. And often this is quoted popularly, in surveys and things like that, [?], a confidence level of plus or minus a few percent. And that's the same 95% confidence interval that leads to this two sigma rule which is the same thing as a t-statistic of 2. Russ: And this is a convention in economics, that 2 standard deviations, two sigma is therefore probably not a fluke. The 95% level of significance. And I want to add one other important point before we go on: when we talk about significance, all we mean in this technical conversation is 'different from random': that there is some relationship. It doesn't mean what it means in everyday language, which means important. So a finding can show a relationship between two variables that's significant but quite small. So it's significant statistically but insignificant in real life. Correct? Guest: Yeah. There's two different concepts and both of them are important. We're talking about statistical significance by a two sigma rule. There's another concept that's equally important called economic significance: is this fact really a big deal or is it small in terms of the big picture of things? Russ: So, as I said, it's a convention that 95% means, Well, there's only a 5% chance. And for many people that sounds--and many economists accept, that that's like, well, if it's only 1 in 20 then it's probably real. We've ruled out the likelihood that this is just a fluke. But as you argue in your paper--and we're going to talk about some different examples of this--when the number of tests that we're making starts to increase, that statistical technique is not as convincing. So, to set that up I'd like you to talk about the Higgs boson. Which seems far away from finance, but I found that to be a fascinating example to help us think about it. Guest: Yeah, certainly. So the Higgs discovery was complicated. It was complicated for many reasons. One, they had to build a collider that cost $5 billion to construct. But once it was constructed, they knew what they needed to find. And this was a particular decay signature that would be consistent with the Higgs boson. But the problem was that that same signature could arise just by random chance. And the number of collisions that they were doing and signatures that were being yielded was on the order of 5 trillion. So, just a huge number of possible false findings for the Higgs. Russ: And we're looking for--we are trying to identify the Higgs--a particular subatomic particle. Guest: Exactly. So, what they had to do was, given the extreme number of tests, they had to have a very different sort of cutoff for establishing a discovery or establishing statistic significance. And instead of using the two sigma rule, they used a five sigma rule. So, way different from what we're used to. And this reflects just the number of tests that were actually being conducted. Russ: So the idea--try to give me the intuition of this. I'm going to collide a lot of things--I'm going to collide particles many, many times, trillions of times. And we know that's going to generate lots of false positives, decay signatures that look like the Higgs but are not. Correct? Guest: That is correct. So you have to be really[?] sure. Russ: So, shouldn't I just--isn't the 'really sureness' just the fact that this is easily confused rather than--what do you mean by the number of tests? Guest: Well, really what we're talking about in the Higgs example are the number of collisions that are taking place. And I'm simplifying what they actually did at the collider. There are many different tests actually going on. But the fact is that sometimes you would get a signature that looked like the Higgs but really wasn't the Higgs. So, in order to actually--and nobody had actually discovered the Higgs. This was the first opportunity. So, they had to be really sure that they were not being fooled by the random sort of occurrences of something that looked like the Higgs. So, to do this they had to be, as I say, five-sigma confident that it really existed. Russ: Now, it's a little bit, for non-statisticians, 5 versus 2, is actually a little bit misleading, right? That sounds like, okay, so it's a little more than twice as big; so we're requiring the result to be a little more comfortable. But as we move numbers of sigmas away from zero, it's a much smaller chance than say a little more than twice as likely that it's by chance? Right? Guest: Yeah. You mentioned the 5% or the 95% confidence. That means that 1 out of 20 will be a fluke. So, for a 5-sigma, it's 1 divided by 1.5 million. So, this is a very small-- Russ: But it could still be a fluke. Guest: Yes, it could be. Russ: It's a weird thing, because you'd think you either see it or you don't. But I guess it's elusive, and there are things that look like it but aren't it, is what you are really saying.
8:40Guest: Yeah. So can I give another example that I think is going to further intuition? Russ: Sure. Guest: It's the famous Jellybean comic. Have you seen that before? Russ: I have; and we'll put a link up to that. It's one of my all-time favorites. So, yeah, describe it. Guest: So, this is a famous cartoon called "Significant." Russ: It's XKCD--the cartoon is in a series. Guest: Exactly. So, somebody makes a statement: I think that jelly beans cause acne. So, they said, okay, scientists: Go investigate. So the scientists go and do a trial. And the trial would involve I guess giving some people jelly beans and other people without the jelly beans. And then they would test to see if there was a significant difference between, I guess, the number of pimples for people that took the jelly beans and ones that didn't. And the test comes back, and there's no difference. There's no significant difference between the two. So basically the next frame of the comic is, well, maybe it's not the jelly bean itself, but the color of the jelly bean. So, then, the comic goes and the scientists test different colors of jelly beans. So, again, the trial would be, let's say a group of people get some red jelly beans and others don't get any jelly beans. And they go through all of the colors. So, red, there's no effects, there's no difference in the amount of acne. And orange, yellow, purple, brown, black, ... Russ: Fuchsia, mauve. Guest: Exactly. The 20th test is green. And they find that there's a relation with the green. So they declare that green jelly beans cause acne, and that's what actually gets into the headlines of the media the next day: Green Jelly Beans Cause Acne. Russ: With a 95% chance that it's not random. That is, only a 5% chance that it's random. Guest: That is true. So that's what the significance means in this particular case. So, everybody knows that there shouldn't be a significant effect, because it doesn't make any sense, what they're actually doing; the original test was the correct test: jelly beans versus no jelly beans. But, the more tests that you actually do, it's possible to get a result that is just a fluke. It's something random. Russ: And if you do 20, you expect one of them to, by chance, show that relationship. Guest: That's right. So that's why, when you do 20 tests, you can't use the 2-sigma rule. So by the 2-sigma rule, if you try 20 things, then the odds are that something is going to come up as a fluke, as a finding that really isn't a true finding. So, if you are going one test--so the original test that they did, in the comic, were they tested a group of people; they gave them jelly beans and the other group, no jelly beans. In that test, two sigma is fine. That's a single test. But once you start doing multiple tests then you run the risk that something is going to show up as a fluke, and two sigma is not good enough.
12:33Russ: So, let's take the example you give in your paper, which is really beautifully done. And although there are some technical things in the paper, I think the average person can get the idea of what you've done there, which is: You present, at the beginning, a particular trading strategy. Meaning a way to "beat the stock market," make a lot of money. And the strategy that you show, you of course tested over a long period of time, because people know that, in a short test, maybe by luck you would just do well. But it's over many years. And although it doesn't do so great in the first year, it then does very, very well consistently, including through the financial crisis of 2008, when many people lost their shirts and other pieces of clothing. And it looks like a fantastic strategy. And you evaluate that with the Sharpe Ratio. And talk in general terms if you can about what the Sharpe Ratio is trying to measure as a way of evaluating in particular stock trading--investment--strategy. Guest: So, the Sharpe Ratio is basically the excess return on the strategy. Just think of it as the average annual return on the strategy divided by the volatility of the strategy. So, the higher the Sharpe Ratio, the more attractive the strategy is. And, indeed, there's a direct link, a direct relationship between the Sharpe Ratio and the t-statistic that we were just talking about. So, they are mathematically linked, and a high Sharpe Ratio means a high t-statistic. Which means that the strategy is a strategy that generates a return that is significantly greater than zero. Russ: And that's actually relative to a so-called risk-free return--Treasuries? Guest: Yes. Usually you subtract out a benchmark, so just a risk-free. Russ: And I might want to be comparing it to, say, a different benchmark. Right? Say an index mutual fund, which a lot of people hold. I'm often interested in, Did this strategy, this manager or this different kind of investment pattern, did it outperform the S&P 500 (Standard and Poor's 500-stock index)? That's not risk free but it's relatively cheap, low cost, because it's automated. Is that correct? Guest: That is correct. So it's often used to look at excess performance. That is a strategy: you can think of investing in somebody who's got a certain return stream and then think of it as shorting the S&P 500 futures. And that's a strategy on its own. And the question is: Do you get a return, an average return on that strategy, that is significantly different from zero? Russ: Say that again? Guest: Is significantly different from zero. So, that's basically, if it's different from zero--if it's above zero, that is an indication of skill: that the strategy actually has something that the market doesn't recognize and leads to some positive return, on average. And that's what we all seek. We seek to beat the market. Russ: I'd say it in a different way. We are very--it's easy to be seduced. We do seek it, but we also, we desperately would love to have sort of that inside path, the secret strategy: 'The suckers, they're just, they're accepting that mediocre return; but I've got the genius advisor running my money, giving me financial advice, and so I'm making a premium.' We have a real urge to have that. Guest: Yes. We definitely want to be better than the average. And this is a prime target. And we want to allocate our money to managers that we believe are skilled. And skill means that you can outperform the market. Russ: And one of our lessons today, for me, thinking about these issues is how difficult it is to measure skill. So let's--in your particular example, this strategy which you start off the paper with, it has a significant Sharpe Ratio, right? It's much better than the average return. Guest: It looks really good. You mention that there was a bit of a drawdown in the first year, but it wasn't really that much--like 4%--and then it performs really well, all the way through the end of the sample. You look at it--it is significant. It is like 2.5-sigma. So it means that with the usual statistical test, you would declare this strategy to be true. And that this is something that actually does beat the market. Russ: And here I am, naively investing my portfolio in a lot of index funds, and I obviously should switch. I'm losing money. I'm a fool, because I should be doing this. Guest: That's what it looks like. Russ: Yeah. But explain how you generated that fabulous strategy and why it's a bad idea. Or at least not significantly proven to be a good idea, even though it's way more than 2 standard deviations above the likelihood that it's random. Guest: Sure. So the opening panel of my paper shows this great strategy. Very impressive, 2.5-sigma strategy. The second panel of the paper shows that strategy, plus 199 other strategies. And it turns out that what I did was I generated random numbers. Russ: Say that again? I'm sorry. You cut out there for me, anyway. Guest: I generated random numbers. Basically it's completely--there's no real data. I generated a series of random numbers, with an average return of zero and a volatility that mimicked the S&P 500. And on this graph, I plot the cumulative returns of these 200 strategies. And you can see that on average, the 200 deliver about a zero return over the 10-year period. But on the tails, you can see the original strategy that I presented, that had a 2.5 sigma, that did really well. And you can see on the other side the worst strategy, which had a 2.5 sigma below zero. So, basically, what appeared to be a great strategy, was purely generated by random numbers, had nothing to do with beating the market. And again, this is a situation where you've got 200 random strategies. Some of them are going to look significant when they are not significant. Every single one of these strategies by definition had zero skill. Because I fixed the return, when I'm simulating the numbers, to have an average of zero.
20:35Russ: Let's do one more example, then we'll get to what the implications are. So, the other example you give, one of my favorites, is--I'm going to use the football example. I get an email from a football predictor who says, 'I know who is going to win Monday night. I know which team you should bet on for Monday night football.' And I get this email, and I think, well, these guys are just a bunch of hacks. I'm not going to pay any attention to it. But it turns out to be right; and of course who knows? It's got a 50-50 chance. But then, for the next 10 weeks he keeps sending me the picks, and I happen to notice that for 10 weeks in a row he gets it right every time. And I know that that can't be done by chance, 10 picks in a row. He must be a genius. And of course, I'm a sucker. Why? Guest: Yes. So this is a classic example. So let me set up what actually happens. So, let's say after those 10 weeks in a row you actually subscribe to this person's predictions. And then they don't do so well, after the 10 weeks. And the reason is that the original strategy was basically: Send an email to 100,000 people, and in 50,000 of those emails you say that Team A is going to win on Monday. And in 50,000 you say Team B is going to win on Monday. And then, if Team A wins, the next week you only send to the people that got the correct prediction. So, the next week you do the same thing. 25,000 for Team A, 25,000 for Team B. And you continue doing this. And the size of the number of emails decreases every single week, until after that 10th week, there are 97 people that got 10 picks in a row correct. So you harvest 97 suckers out of this. Russ: Who are willing to pay a huge amount of money, because you've got inside information, obviously. And I can make a fortune, all on your recommendation. Guest: That's what it looks like. And the fact is, that this is basically a strategy of no skill. Basically, 50-50 every single week. There's no skill whatsoever. But it looks like skill. So, again, when you realize what is actually going on, you can't use the same sort of statistical significance. Because, in the usual case, to get 10 in a row, that is highly significant. But, given what you know has happened, it can't be significant. It's exactly what you expect. Russ: So, that leads us to the deep question: Is Warren Buffett a smart man? I mean, he is called the 'Sage of Omaha.' He's done very, very well. He makes a lot of money relative to his competitors. Berkshire Hathaway, which is his stock version of his portfolio, is a wild success. So obviously he's a genius. True or false? Guest: So, I've not actually studied the Berkshire Hathaway data. So, I'm not going to make a judgment on it. But maybe I will. It's a good idea for my research. So, this is--when you've got 10,000 or more managers that people are going to look really good, year after year, just because you've got over 10,000 managers, purely by chance. So, they could be monkeys throwing darts at the Wall Street Journal. So, this is exactly the same situation that I'm highlighting in my paper. That many managers will, potentially for 10 years in a row, beat their benchmark. And it will be a result of luck, that there will not be any skill. And this is also important: That you could have many managers that are skilled, that are excellent managers-- Russ: Yeah, the flip side. The flip side is hard to remember. Guest: Yeah. This is very important. Because often your manager doesn't perform as well as you wanted, for, let's say, 2 or 3 years in a row. And then you ditch that manager. And that's a mistake. Because it's possible that that manager is a skilled manager and basically suffered from some bad luck. It's the randomness that can get you. So, mistakes are made on both sides. And that's really what I'm addressing in my research.
25:40Russ: So, this--it raises a very tough question, and it's a question that's pervasive. I may bring up some other sports examples in a minute. But just in case I don't, you know, people will talk about so-and-so is the greatest coach of all time; so-and-so is the greatest quarterback of all time. Or that Campbell Harvey is the best finance professor of all time. Because of a variety of factors. And of course there's a random element in every aspect of this. So it leaves you with the uneasy--a thoughtful person is left with the uneasy feeling that the normal ways that we assess quality are deeply flawed, because they are a mix of luck and skill. So, where does that leave us? Guest: Well, the first thing is you need to realize the impact of this randomness. And you are correct--that there is so much that goes on that we attribute to skill that isn't skill. And there might be a sports example where somebody is on a so-called hot streak. And again, it could be just purely random--that you've got connecting like 10 baskets in a row, or something like that. And it's really important to separate that out. So, how likely is it that something like this could happen? You'd be surprised at how likely it actually is. So, this is definitely the case. Yes it's true, even in scientific work: it is possible to make a discovery--like a real discovery--but it's random. You just get lucky in terms of what you actually do. In other fields of scientific research, to actually--and this is kind of interesting--the first person to publish, let's say a medical discovery, is often, they call it that there's a winner's curse. And basically what that means is that the first person to publish it, given all of the data mining that actually went on, it's likely that the effect is overstated, number one, or number two, doesn't exist. So when people replicate the study, they find that the effect is a lot smaller. Russ: try to-- Guest: Yeah. But this happens all the time, that it is so difficult to separate the skill from the luck. And I'm afraid that we don't really think it through. And we use these rules that are pretty naive.
28:21Russ: So, Nassim Taleb has been a guest on EconTalk a number of times. And of course he's associated with issues related to these questions in his book Fooled by Randomness and in The Black Swan. I want to talk about black swans for a second. We're talking about: Is this a good strategy relative to other strategies? Another question we might ask as investors or as decision-makers is: What's the downside risk? I want to be prudent. I don't want to take an excessive risk. I don't mind sometimes if I make a little bit less. But what I don't want to have is, I don't want to be wiped out. I don't want to have a catastrophic result. So, I worry, of course--I should worry also, not just about the average return but about that left-hand tail. Of a really catastrophic event. Can you talk a little bit about the role that assumptions play in assessing strategies, typically, in the financial literature? Taleb has argued that normality, the persistent presumption of normality, is a very destructive assumption because, although it makes things more tractable, it often ignores these left-hand tail events when they are so-called "fat"--when they are more likely than they would be in a normal distribution. Can you talk about that for a little bit? Guest: Sure. And this is actually part of my other research string, so it's very convenient that you mention this. So, we've been talking about something that is called the Sharpe Ratio. And that's basis the excess return divided by the volatility. So, it turns out that this is definitely not the sort of metric that I would recommend using for evaluation of a strategy. And the reason is the following: That it does not take into account the tail behavior. So, it kind of--it assumes, directly, that the things that happen on the downside look approximately the same as the things that happen on the upside. So, it's symmetric. And often, you get this situation where you look across different investment styles, and you see that some investment styles have very high Sharpe Ratios, and other investment styles have very low Sharpe Ratios. And the reason is not that one investment is any better than the other. It's that there is a different sort of tail behavior. So, for example, the low Sharpe Ratio strategy might have the possibility, once in a while, of a big positive outcome. Like a lottery sort of payoff. Whereas the high Sharpe Ratio strategy, on average it does really well; but then there's a possibility of a catastrophic downside. So, the Sharpe Ratio is only useful if you are evaluating, as you said, relatively normally-distributed returns. Because it does not take into account the downside or upside differential. So that's something that I do not recommend. Russ: But it's a fascinating thing, because as an investor--and of course this comes up in social science as well--I don't live forever. I don't get an infinite number of draws from the urn; I don't get to play roulette for a million years. So, I get the particular string that comes out from t0 to t14 in my time at the table. And of course that's a particular string. You can come back from Las Vegas and make a lot of money and think you are a great card player, when in fact you just happened to sit in on that string. And when you think about those asymmetric returns, I don't get the average return. I get whatever that happens to be in that time period when I'm holding that asset. I think one of Taleb's insights is that we think so often about normality, and so many things in life are normally distributed--height and weight and other things--that when we deal with these kind of problems we really don't have the apparatus, the intuition that we need to have. Guest: Yes. I totally agree. I think that Finance has done a particularly poor job. Indeed, if you look at the textbooks, the classic textbooks in Finance, there's very little mention of tail behavior. So, I'm always an advocate of holding a diversified portfolio, but that portfolio needs to be diversified over a number of dimensions. So, we usually think of getting that portfolio with the highest possible projected return for some level of volatility. And my research points to: You need to take this third dimension into account, which is, the technical term for it is skewness. So, something that's negatively skewed has got a downside tail that you don't like, and something that's positively skewed, like a lottery, has got a positive tail; and you need to take that into account when you form your portfolio. Because people have a preference. They prefer a strategy or a portfolio or an asset return that's got the positive tail. We want the big payoff. And we've got a distinct dislike for assets that have catastrophic downsides. So, that needs to be taken into account in designing and optimal portfolio strategy. And that is what I pushed in an article I published in the Journal of Finance a number of years ago. Russ: Yeah. If you put a dollar on red at the roulette wheel and you lose it, it's not a big deal. If you put your fortune on red--or a better way to say it is if you put a dollar on red and they can then go into your bank account, if red hasn't come up a certain number of times--you don't get that second chance. The problem with the catastrophic thing is you can't bounce back. By definition you are in a hole that you either literally can't come back from--you are bankrupt--or, you are going to require a very, very, very long period of time to make your money back. Guest: Yeah. That's true. That's exactly how it works. Can I tell you a true story, something that happened to me recently? Guest: Sure. Russ: I get a phone call from a Duke graduate, who actually went to the Business School, and he wanted me to basically endorse a product that he'd been running for a few years. He was managing $400 million, a simple strategy, that he was buying S&P 500 futures, so he was kind of holding the market. But then, he was also adding on some options where he was writing or selling options that were out of the money for calls and puts. And when you do that, you collect a premium. And for the 5 years that he was operating, that premium led to an extra return. So it looked like he was beating the market. So every year he had about 2-300 basis points, or 2-3% above the S&P. And he basically said, Well, this is a great strategy; are you willing to endorse it? And I basically said to him-- Russ: 'Are you out of your mind?' Guest: 'You didn't take my course. Because you would never ask me this question.' So, think about what that strategy is doing. So, that strategy, when the market goes up a lot, that means that those options are valuable and you give up your upside. So, you have to pay off the person that you sold the option to. So you give up your upside in extreme up movements. And on the downside, if the market goes down, then you need to pay. So your downside is magnified if there is a big move in the market. So think about, so basically this person has changed the payoffs of the S&P 500, cut off the upside, and magnified the downside. And this extra return that they are getting over the 5 years, well, they are lucky over the 5 years. You haven't seen a big move up or down. So this is a great example of tilting the portfolio more toward this negative skewness. And when you've got negative skew, that means that the expected return should be higher. Because people don't like this downside possibility. So this kind of brings it together, that anything like this, whether it's an option or an insurance policy--insurance policy you pay a premium for and that's kind of a negative return for you, but if the fire happens and burns your house down, then you get the payoff. So you are willing to protect that downside. And I think that we don't really think this through enough in the way that we approach our portfolios.
38:33Russ: Yeah, well, it's again, not just our portfolios. It's many, many things in life where we are evaluating the effectiveness of some strategy, and we don't like to think about it. It's a cherry-picking example: I want to think I'm doing the right thing and I look at how well it's going and then I can brag to my brother, hey, I made a killing. But I don't realize that the 5 years that I've been observing the data are not typical. And of course they never are. Almost by definition. There are things going on in any 5-year period that are distinctive, that you have to be careful about. So, to me, the lesson in this is you have to, when you are trying to evaluate quality of anything you have to have some intuition, what I would call wisdom, which is very hard, about the underlying logic of the strategy that the data itself don't speak. And I think this is a very dangerous problem in economics generally. So let me take it there and we'll look at some of the implications. So, Ed Leamer's critique of econometrics and statistical significance is very similar to yours. Which is, if you run a lot of regressions, if you do a lot of statistical analysis and try all these various combinations of variables that you hope might show some significance, and then you find it, the classical measure of the t-statistic greater than 2 is meaningless. Guest: Yes. Russ: And the temptation to say, 'I found something' is so powerful, because you want to get published. And as you point out from your previous example, a lot of the findings aren't true. They don't hold up. Guest: So, it's not just getting published. My critique applies to people that are designing these quantitative strategies. I was--again, this is a true story. A number of years ago I was shown some research, at a high-level meeting, at one of the top 3 investment banks in the world. And this person was presenting the research, and basically he had found a variable that looked highly significant in beating the market, with a regression analysis, as you said. And it turned out that this variable was the 17th monthly lag in U.S. Industrial Production. Russ: Yeah. I've always known that's an important factor. But that's the beauty of his approach: nobody knows it, but by his fabulous deep look at the data, he uncovered this secret relationship that no one else knows. Guest: So, 17th lag? That seems a little unusual. So, usually we think of maybe the 2nd because one month the data isn't available because of the publication delay. Maybe the 3rd. But the 17th--where's that coming from? And then he basically said, 'Well, that's the only one that worked.' So, it's the jelly bean example. There's a paper that's circulating that looks at the performance of stocks sorted by the first letter of the stock name. So, they'll look at the performance of all the companies that begin with the letter A, B, C, D. And one of them is significant. Russ: One of the 26. How shocking. Guest: Exactly. So this stuff happens all the time, whether it's in a very reputable investment bank or whether it's within academia. People basically are not adjusting for the data mining that is occurring.
42:36Russ: Yeah. So, Leamer's suggestion is to--he has more than one, but one of his early suggestions was to do some sensitivity analysis, basically, look at all the combinations of the variables that you might look at and if you find that under most or all of them there's a narrow band of effect of the one that you're trying to claim is the key variable--say, an example he gives in his paper, a beautiful example, is: Does capital punishment deter murder? And, does the threat of capital punishment induce people to stop killing other people? And he shows it's easy to find an analysis that shows that it does. And then of course it's easy to find one that shows that capital punishment increases murder--perhaps because, who knows why, more brutality in the culture, in our society. But it all depends on what you put on the right-hand side, what different variables you might include. But if after you did that, you found that it always deterred, or never did, then you'd feel more confident. So, you have a suggestion in Finance. What are your suggestions? Guest: Well, there are a number of suggestions that I explore in my research. One suggestion is to actually ditch the 2-sigma rule and move the cutoff higher just like we do in physics or genetic science and things like that. There are other approaches, too. The most popular approach is the so-called 'out of sample' approach, where you actually hold out some data; you fit your strategy to the past and then apply it to, let's say, the most recent 5 years of data to see if it actually works. And that is a long-established method. Russ: It sounds very good. Guest: It sounds good, but it's got problems. For example, you actually know what's happened in the past. So, if we hold out the last 5 years, well, we couldn't[?] remember that we had this major global financial crisis? So the researcher knows that and might actually stick in some variables in the early part of the data that they know are going to work in the other sample. So, that's one problem. The other problem is a flawed scientific procedure where somebody looks at a model in sample and then takes it out of sample; it doesn't work. Then they go and basically re-do the model, removing some variables perhaps or a different method and then try another sample. It fails. And then they just keep on iterating back and forth, back and forth, until something works. And of course that is--you are just asking for the fluke situation. So, the final problem with the so-called out-of-sample technique is that you might fit a strategy over a number of years and then test it in the last 5 years; and the strategy might fail, but it might be a true strategy and it's only because you have so few years, that just by bad luck the strategy fails. So, it is not a panacea to actually go to the other sample method.
46:16Russ: So, some listeners will remember an interview that I did with Brian Nosek, the psychologist, a few years back, where he and others in psychology have become worried that some of the more iconic results in psychology are not replicable. And they've tried to replicate them. And some fail, some don't, obviously. But it seems to me a huge, enormous problem. As we say, it's one thing to talk about some particular psychology theory. When we are talking about people losing money for their personal retirement or when we talk, more importantly even about epidemiology where some claim about the relationship between, say, alcohol consumption and health, positive or negative, is going to maybe cost people's lives or save lives. The fact that many of these results don't stand up to replication seems to me to be an enormous problem in our scientific literature. It's a social science problem; it's a physical science problem. Do you agree that we've got a big problem there? That much of this so-called science is not scientific? Guest: I agree that there are problems but let me just elaborate a little bit. Psychology is at the very bottom of the hierarchy of science in terms of publishing results that are not significant. So, what I mean here is that it's rare in psychology to actually publish a paper where you pose a hypothesis and the hypothesis garners no support. And you have a non-result. So, that's very difficult to publish in psychology. So the sort of papers that are published in psychology, over 92% of them are, 'Oh, here's a hypothesis; I did an experiment; and I get support.' So that is a problem. Because it leads to people essentially data mining to find a result to find a result and then getting it published. On top of that, when you data mine, it is possible to figure out that it's data mined if you replicate. And in psychology, there is a very poor culture of replication, people not that interested in replication of these experiments. And this contrasts, as you said, with medical science--epidemiology is a good example--where somebody actually might data mine the data and publish something, but then a half dozen other people replicate it and find that it isn't a fact. And we actually learn something when that actually happens. In Finance, there isn't a large culture in terms of replication, but there's a particular reason for that. We're not running experiments with human subjects. We're actually looking at data, and the data is a fact. So, if you are looking at the New York Stock Exchange data, you are--everybody's got that data. So if you tell me that this particular value strategy has an excess return over the last 50 years of 5%, well, anybody can go in and immediately, with one line of computer code, replicate that. So there isn't a large culture of replication because it really isn't necessary in terms of what we do. And psychology is a totally different game; and indeed they've had terrible trouble, with people inventing the data of these experiments-- Russ: Yeah, that's another-- Guest: and having to retract. Russ: Yeah, that's a separate. But it seems to me that-- Guest: [?] Russ: [?] problem in Finance and in epidemiology. So, let me lay that case out and you can answer it. It's true everybody has got the New York Stock Exchange data. But that person who runs the 17th lagged Industrial Production variable and proves, using statistical techniques, that it's important, has the issue that, well, is that going to work going forward? That to me is the replication, the equivalent of replication in that model. And similarly in macroeconomics--the cherry picking of sample size, of sample time period, of various variables to prove that Keynesianism works or doesn't work, that monetary policy is crucial or is irrelevant--to me it's just an intellectual cesspool. I hate to say it, but I don't see-- Guest: I totally don't agree here, because my paper basically says that you need to adjust the significance for the number of tests. So, that person that ran the regression that the 17th lag of Industrial Production came in as significant--if they adjusted the significance level, given that they ran 24 different tests, they tried 2 years of lags, then that 17th variable is not significant. You would reject that variable. So my paper actually provides a method to avoid some of these mistakes. And again, this is a big deal. It can be somebody's pension money. Russ: Yeah. Guest: Running on a strategy like this. What I'm saying, you need to take into account that we've got, in this particular situation, 24 different things that have been tried. Not 1. And if you do that, then you minimize the chance that some bogus strategy based on a fluke finding is basically allocated to in your pension and you lose your money. So there are ways to deal with this, and my paper actually provides a method to do that. Russ: Yeah, I understand that. But you happen to be sitting in a meeting that was an informal meeting and you were able to ask the question, 'How many times did you run that?' When I see--I'll give you just my favorite example. My favorite example was in epidemiology. There was a paper that showed that--front page of the New York Times; it was an enormous story--that drinking alcohol increased cancer among women. And that's a frightening thing. Obviously you don't want to fool around with that. And unlike many of the journalists, I actually went and got the paper and I read it. And I contacted the researcher. And there were two things in the paper that just didn't get mentioned. One was the fact that they had the cancer history of the population in the sample; they had, 'Did your mother have cancer?' They had that information. But they did not use that in the analysis. I don't know why--since we know there's a genetic relationship. I don't know why they didn't use that. But more importantly was how they defined drinking and not drinking. They threw out all the people who didn't drink, on the grounds that people who say they don't drink in a survey maybe used to drink. And then we'd be mismeasuring it. Well, that's true. It's also true that people who say they drink maybe had different drinking habits in the past. And of course once you throw out the non-drinkers, some of them--actually they had worse health. Not some of them--the average non-drinker had more cancer than the people who drank a little bit. So that was awkward. To throw those people seemed to be a rather unfortunate decision. They did it on the grounds that maybe those weren't measured accurately. Of course, others weren't measured accurately, either; they were all based on, say, memory, in this case, or whatever it was. It wasn't a lifetime sample or real-time observations. So, somebody publishes a paper in economics, and they don't tell you how many regressions they ran. Ever. Never. We don't get to see the video of what happened in the kitchen when they ran these tests and when they transfigured the variables and decided that the squared term was the right term. So, it seems to me that in the absence of that, we are really in, many, many of the things we find are not going to be replicated effectively. Guest: So, I 100% agree with you, with what you just said. That it is a cesspool. That, what I was talking about earlier was fixing something relatively straightforward, where you know 24 tests have taken place. And the 17th lag works. So you can adjust for that: it's not significant. But what you are talking about is a broader critique that again, I mention in my research: That it's not just the number of tests. So, the other problems that arise are the manipulation of the data. So, it might be that you start your analysis in 1971 versus 1970. That one year could make a huge difference in terms of your results. It might be that you trim outliers out of the data. It might be that you use an estimation method that has a higher chance of delivering a so-called significant result. So it's not just the number of tests, but it's all the choices that researchers make. And it is a very serious problem in academic research, because the editors of the scientific journals don't see all of the choices that have been made. It is also a problem in practical research in terms of the way that people are designing strategies for investors. However--and this is kind of, I think interesting. My paper has been very well received by investment bankers and people designing these strategies. And actually it's interesting because they actually don't want to market a strategy that turns out to be a fluke. Because that means that it hurts their reputation. It reduces the amount of fees that they get. And it really, basically it could reduce their bonus directly. So that actually has a strong incentive in terms of business practice to get it right. So, within the practitioner community, at least, there are strong incentives to reduce the impact of data mining, so that you can develop a good reputation. However, on the academic side, it's not as clear. As you said, there's minimal replication in some fields. And the editors don't see all of the hocus-pocus going on before the paper actually is submitted for scientific review. Russ: Yeah. When you were in that meeting at the investment bank and the person said it was significant and you said, 'Well, how many did you run?' and he said, 'Well, 26, 24', whatever it was, and you said, 'That's not significant': Nobody around the table said, 'So what? Doesn't matter. We'll be able to sell it because it's over 2.' Guest: No. People, I'm sure: They do not want to do this. So that damages the reputation hugely. So, everything is reputation in terms of kind of street finance. And you want to do the right thing. You want to have in place a protocol--an explicit protocol--where some investor asked the question, 'Well, how do I know that this isn't due to data mining? And then what you can do is to point to your protocol, saying, 'We're well aware of data mining. And we actually take the following steps to minimize its impact. We obviously can't get rid of the chance that some findings could be a fluke, but we try our best to minimize that because we want to do the best thing for you, because that's how we make money.
58:37Russ: So, I was going to ask you--well we won't talk about it but I was going to raise it anyway. I was going to ask you whether you think Mike Krzyzewski, the coach at Duke University's basketball team is a good coach. He's considered one of the greatest coaches of all time. He's won over 1000 games. So, everyone knows he's a good coach. Or Bill Belichick of the New England Patriots. Everyone knows that person's a good coach. It seems to me--of course, there's a random element in those records, that deceives, that's complicated. And when we think about examples like finance or epidemiology, it seems like if we don't come down to the issue of what's really going on underneath the data, what's the model that you have in mind, that you are trying to measure--it seems like we're really lost. The reason I mention that is that, you know, in epidemiology, I don't really know the mechanism by which alcohol causes cancer. I hope some day we'll uncover it. But then just looking at statistical relationships without understanding the underlying biology seems to me to be dangerous. And similarly, if I don't understand why the 17th lagged measured level of Industrial Production is significant--it's not just that it's 26 tests and therefore not statistically significant. It's that: It doesn't make any sense. So, I think ultimately in all these cases when we are trying to assess the component of our outcomes that are due to randomness, we have to have some fundamental understanding of the causal mechanism or we are really at risk. Guest: I agree with that. So, I didn't need to adjust the significance level on the 17th lag of Industrial Production. That model is gone. It's history. And the employee that did it is probably history, also, after these comments. But it might be more complicated. It might be just that you see a strategy. I don't know really what's behind it. Because often--and then an investment banker or company might not be willing to reveal the inner workings of the model. Russ: Correct. Guest: So you need to have some sort of statistical method to actually do this. But I agree that the best thing you can do is to ask the question: 'What is the economic mechanism? Why does this work? Tell me the line of causality.' And try to minimize sort of spurious sort of relationships that are often put to the public where people are claiming causality when it's much more complicated. So I think that the bottom line here is that you do need to have a solid economic foundation. You need to have a story. Or you should be very suspicious of the performance of a particular strategy.
1:01:34Russ: So, let's close with a thought on big data. There's an enormous, I think, seductiveness to data-based solutions, to all kinds of things, whether it's portfolio analysis, whether it's health. All kinds of things. We're much better at measuring things and we have a lot more measurements going on, and so a lot of data is being produced. And, there's a golden ring being held out there for us to grab that just, 'If we just use more data we're going to be able to improve our lives tremendously.' And of course, without data, you're in trouble. You don't want to just rely on your intuition. You don't want to just rely on storytelling about which anecdotes or which narrative sounds convincing. You want to measure stuff. To be scientific. Almost by definition. And yet, so easily, we make these kind of mistakes, both in our personal reasoning and individual reasoning and also on a professional level. And we talk about these kinds of statistical analyses. So, what are your thoughts on this romance that we have for data and evidence? Do you think it's a legitimate one? Or is it a very mixed bag--which is my bias? Guest: Um, well, I think that there is an upside and a downside. So, I guess I concur that it is a bit of a mixed bag. I think that big data is not going to go away. That, it's here to stay. I think it's very important, indeed it's part of my paper, to emphasize that we need to kind of evaluate the findings from data mining in a different way than we've evaluated findings in the past. So, that's clear. Because some of these findings are going to be random. And we want to get rid of the flukes. What worries me is that some of these findings that are found with the big data and the data miners, that people look at the findings and then only after the finding, they concoct a story or a theory, develop ex post some intuition, as to why this should work. And I think that's dangerous. I prefer to work on the basis of first principles: what is reasonable. So, you can concoct a ridiculous story about the 17th lag of Industrial Production--that is, something to do with some semi-annual, seasonal that exists and the peaks and the third semi-annual period. That's not what we need. That doesn't advance us in terms of scientific knowledge. So, I worry about the cart behind the horse, put in front of the horse.

Comments and Sharing



TWITTER: Follow Russ Roberts @EconTalker


COMMENTS (41 to date)
Buzz writes:

How can you study this field and *not* have an opinion on Warren Buffett ?

Mark K. writes:

Good discussion on a number of issues.

Campbell's protocol to increase the significance needed would certainly be a step forward, but as Russ said.. academics often fail to use anything close to best practices.

While I normally agree with Russ's point that we should understand the underlying mechanisms in order to decide if a hypothesis is true, there are exceptions.

Taleb actually provided one himself, I think in the Black Swan. Apparently, epidemiology found out about the benefits of hand-washing well before medical science explained it. A lot of lives could have been saved by going with the (good) data before we had an explanation.

Second, my understanding is that the hedge fund Renaissance Technologies has made returns clearly beating the market by detecting patterns no else has found, for which they have no story - though I'm sure there's a mathematical reason they think they're significant.

I think both Renaissance Technology and Buffett are 9-11 sigma events. Since we haven't had trillions of investors competing, it doesn't seem likely to be luck.
It goes back to the old debate - how are markets efficient if everyone buys the market? To make markets efficient, you need arbitrage; to get people to do arbitrage, you need to offer returns above the market.
So some people do make excess returns from skill, but it's still often next to impossible to tell who they are in advance.

David R. writes:

Most introductory statistics texts call the chapter "hypothesis testing", but don't say too much about what a hypothesis is. Perhaps we need more clarity that a hypothesis has to be something more than 'if I do enough regressions I'll find some relationships'. Alternatively, or in addition, there might usefully be chapters on 'how to identify interesting relationships when you don't have a hypothesis'- ie, let's admit folk WILL mine the data, so let's give them better thinking tools to avoid fooling themselves, or being fooled by others.Big data and its associated tools can uncover more subtle signals in the noise, but it can also generate spurious results faster than ever before.

rhhardin writes:

I've long complained about 1 in 20 scientific studies being wrong, in the thousands of discoveries each year. That's quite a few incorrect contributions to human knowledge.

But there's a deeper problem as well, that the uncertainty probability tries to run an implication backwards.

Only if the computed statistic follows the law that we assume it follows on this data does the probability equal .95 etc.

But that's not known.

Trying to prove it does runs into the same difficulty with another statistic, and so forth to any level.

At bottom, it's talking through your hat.

Even rank statistics have the problem, say owing to nonindependence.

Duncan Earley writes:

Warren Buffets strategy is by definition long term. That makes it hard to evaluate in the short term. For example he holds around 9% of IBMs stock. For the last 40 years IBM has been doing great (Buffet brought in 2011 I believe), but right now I wouldn't bet on them existing in 10 years. At what point he decides to sell his IBM stock is key to evaluating final returns. I'm sure many other of his holdings are in the same position.

So basically it could just be luck up until this point.

econfan writes:

I find it quite strange that the guest states that the p value is the chance that the finding is a fluke.

This is exactly the interpretation that intro statistics tries to avoid students learning.

A frequentist p-value is rather the chance that one would get an observation appearing as strong, or stronger, that it isnt a fluke, when it actually is.

Example: p=0.05 found for men being more fat than woman means that, IF men in fact are no more fat than woman, we would only 5% of the time expect to find a sample which indicates the opposite as strongly as the given sample.

Robert Morris writes:

I am marking my ballot for one of the best episodes of the year right now.

Entertaining, useful, and enlightening can all be used to describe this episode. It was so good it triggered this, my first ever comment, after being an eager listener for years!

I think it the best layman's explanation of what can go wrong in everyday statistics/performance comparisons I have heard.

Russ and EconTalk, you remain at the top of your game.

Wayne P. writes:

Great episode, I've already used the 17th month lagging indicator as an illustration of potential folly. David R's point on big data is well taken. My concern is that we are going to get more bad decisions, faster, as we democratize analysis via "big data". This risk increases if people don't understand what they are looking at.

gwern writes:

Just so, econfan. Lines like "And for many people that sounds--and many economists accept, that that's like, well, if it's only 1 in 20 then it's probably real." reveal a profound misunderstanding of what p-values are (and prove again Cohen's observation that p-values show the opposite of what we want to know, but we want to know it so much that we will pretend that p-values are it).

The more relevant way would be a Bayesian observation: p-values with cutoffs of 0.05 or so can be interpreted as roughly Bayes factors of 3, that is, if we get a hit for a hypothesis we can triple the prior odds.

What are the prior odds of any particular trading strategy? Well, there are hundreds of thousands of active traders and analysts and algorithms running trying to find profitable trading stategies, who expect to try out hundreds or thousands or even millions of strategies before finding a good one, and the market is known to be highly efficient, so the prior odds must be extremely low; let's say 1/1000 to be generous.

Then upon hitting p=0.05 for strategy X, we do a quick update and triple the odds to 3/1000 or 0.003 or 0.3% probability of X being a real strategy. 0.3% is not very likely.

Looked at this way, it's not remotely surprising that most Xes will fail abysmally out of sample!

Greg G writes:

This episode was a great example of EconTalk taking some important technical research and making it very accessible to the layman.

My understanding is that it has been well known to academic specialists in finance for some time that active management dramatically and consistently underperforms index funds after fees. When compounded over decades, as retirement funds often are, the difference can be truly staggering.

We normally expect markets to drive out sellers who overcharge and they normally do. It is a mystery why so many active managers are able to overcharge for underperformance. At this point, that aspect is a question for behavioral economists not finance specialists. I would love to hear a podcast on that topic at some point.

Scott Packard writes:

By the way, you should probably do a mouse-hover over the three green jelly beans in the newspaper headline in the xkcd cartoon mentioned in the podcast. It contains an easter-egg. Enjoy.

https://xkcd.com/882/

Regards, Scott

Richard Fulmer writes:

The investment strategies discussed were all statistically based. Venture capitalists, by contrast, research individual companies, talk to the principals, assess the products and services, and review the business plans. How do their results compare with an "optimal portfolio strategy"?

Daniel Barkalow writes:

Since neural imaging isn't very close to economics, I suspect you may have missed the highly relevant paper titled "Neural Correlates of Interspecies Perspective Taking in the Post-Mortem Atlantic Salmon: An Argument For Multiple Comparisons Correction".

Aside from the entertainment value, it seems to have actually gotten people doing that kind of research to pay attention to the statistical problems with their methods, since there's no more awkward question to get about a research finding than "Have you checked if it's also true of a dead fish?"

jw writes:

On VC returns:

The last study I saw was a few years ago and everything is better after a bull market (a LOT more on this later when I have more time), but long term, VC's lag the market by a small amount. The usual effect of everyone remembering the Googles and forgetting the 20 failures applies in spades (and certainly helps their fundraising...).

Yusko claims that without the internet bubble and the mid 80's, long term returns would be near zero (again, a few years ago).

Of course, this doesn't mean that VC's don't provide value, as it depends on how their return metrics might diversify your entire portfolio. (Whether that value justifies their fees is an entirely different subject...)

Dallas Weaver Ph.D. writes:

Glad to see this issue, which is common to all areas that are doing "data mining" to get the results they want without valid and universal "mechanisms", becoming better addressed.

A good example would be all the "cancer clusters" that were discovered around everything from dumps to industrial facilities to long island itself. As researchers varied the diameter and shape of the circle around the target, they would get their "statistical significance" and publish the results, feeding the law suits.

The government itself often plays these games in studies used for regulatory purposes, without even having to run them through "peer review" with truly outside experts, who may question "missing relevant variables" or how many hypothesis were tested or what viable mechanism explains the result and whether than mechanism is universal (biochemistry is universal).

This is how you end up with studies showing that sub-parts per trillion exposures to chemical X cause Y,Z problems with some hypothesized mechanism, but the same mechanism must not have applied to the tens of thousands of workers exposed to thousands of times higher concentrations for decades longer exposures that are still around and very much alive.

Mike Gomez writes:

I've noticed some confusion both within myself and in the comments from other listeners about the nature of fat tails. See notes in the latest Taleb episode at 38 min in for when Russ brings up the topic. Here is how I make sense of it:

Fat tails are about how the values of outliers affect our picture of the distribution of a random variable.

As Taleb says, take wealth. Figure out the average wealth per person worldwide without including the wealth of the richest person in the world. Imagine that distribution. Now include the richest person and calculate the average. The new average will be substantially higher than the previous one because wealth is a fat-tailed distribution, meaning that the outliers move the average and raise the standard deviation when they're included. With weight, on the other hand, you could include or exclude the fattest person and it wouldn't change our notion of the variance or central tendency of weights for the human population.

Why does this matter? Because if a distribution is fat-tailed, then the true expected value can't be accurately determined unless you have an idea of how big the tail events can be and how likely they are to occur. Economic theory suggests we should take expected values into consideration when making decisions, but ignorance of fat tails will lead to mistakes if we try to take that advice.

Lio writes:

About the football predictor example: there's a similar example given by John Allen Paulos in his book "a mathematician plays the stock market". This is in the chapter "a stock newsletter scam".

jw writes:

As usual, an interesting discussion.

Prof. Harvey outlined many pitfalls that entrap naïve trading system developers and also most of the lay public when it comes to evaluating trading strategies. All of the topics that he described have been detailed in many trading software user forums and in other financial journals but are largely unknown to the public and unfortunately, many asset managers.

The fact that his ex-student could raise $400M for an option writing overlay strategy is a perfect example. These are classic blow-ups in waiting and it only takes a baby black swan to light a very short fuse. Hopefully, his ex-student returned his capital or changed his strategy before it was too late.

In any event, in this day and age someone allocated a lot of money based on a very dangerous and little understood strategy.

In my experience, few Wall Street professionals understand risk, and even fewer understand systematic trading. In Harvey’s blog, his “Evaluating Trading Strategies” goes over a few more examples, but the primary systematic concerns remain:

- Focusing on returns instead of risk adjusted returns.

- Defining risk as volatility. It isn’t. The Nobel prize committee says it is, but it isn’t.

- Not understanding that returns are non-normal. Sure, academics and a few of the better read asset managers will talk of it and say that they understand it perfectly well, but the math to account for it is really, really hard, so they simply assume it away. Cupolas and jump diffusions turned out to be fingers in dikes.

- Not understanding Out Of Sample (OOS) testing. Harvey discusses it, but it an art. Very few people take into account the potential biases of just glancing at an OOS chart, let alone over testing until OOS becomes In Sample, so kudos to him for bringing this up. (As an aside, try explaining this to the global warming crowd. Their OOS performance has been horrible, so they are “adjusting” data, adding filters, recalculating including the newly In Sample data, and making every naïve system development mistake possible. Unfortunately, someday they will discover that separating the signal from the noise on a geological timescale and then testing it on OOS data – assuming it works the first time – they will have a verified model in a century or two. In the meantime, every dollar spent on GW and green initiatives is being wasted on a unevaluated system not unlike what Harvey describes.)

- Degrees of freedom. A system must limit the number of variables to as few (but no fewer) as possible.

- Drawdown (DD). For some reason, the CTA community has been (rightfully) burdened with this metric forever while equity and bond managers get off Scott free. I have asked multi-billion dollar equity managers what their DD was and they had absolutely no idea. For one I had to define the metric. This is an essential component of risk. Your asset manager knows his 1/3/5/10 year performance by heart and will tell you that “long term, the stock market returns 8%”. But once or twice a decade you wake up and your account is half (or less) than it was a few months or years ago. The asset manager’s response? “Don’t worry, it will come back, trust the market”. How is that working out for Japan? One day the US will not come back as well (inflation adjusted – another peeve…)

- Tax policies – Believe it or not, modern portfolio theory tells you to put your IRA into the riskiest assets possible. They are more likely to be short term and have larger swings but hopefully, they will return more. If you did that in a taxable account, your highs would be highly taxed and your lows will be tax disfavored by the loss limit. Unintended consequences indeed.

- Bull/bear market testing. Someone doing a short term, bullish strategy that started five years ago and did half IS and half OOS testing would be looking at a world beater strategy he is now ready to put $1B into. Big mistake. Bull and bear markets last a long time, have different characteristics and reduce your data set (maybe your test has three long upswings and three long downswings over a decade, how will it do in a range bound if it has never seen one?)

I am too long winded as it is but will leave you with one last concern. No matter how much testing you do, fundamental market changes MAY ruin it:

- A thirty year bull market in bonds cannot continue forever.

- HFT algos and dark pools (and the sly order types and news peeks that they have exclusive use of for a price) have fundamentally changed the market. It is rigged against individual investors. It is still possible to win, but it is harder and the wins are smaller as they take their tolls.

- Finally, the biggest rigger of all is the Fed. When they start participating in the markets to the extent that they have (LSAP, twists, etc), they fundamentally change the nature of the market itself. This has resulted in abnormally low volatility and completely distorts the markets (what happens when you put a negative risk free rate into a Black-Scholes formula?) When the piper is finally paid, at least there will be plenty of volatility to trade upon…

- (…and for GW, the “assumed to be a constant” sun may decide to go quiet for a few decades or centuries.)

So I am off to de-mathify Prof. Harvey’s “Lucky Factors” formal paper. On scanning it, it doesn’t seem like demeaning is a panacea, but we shall see…

Kudos again for bringing up a topic that few people knew that they needed to care about until you presented it. Prof. Harvey just reinforces your other guests who have debunked the quality of the “studies” that appear in the near infinite number of academic journals now (and inspire the ridiculous headlines we see every day).

steve hardy writes:

Thanks for a good presentation. In the early part of my investment career (about 40 years ago) I developed a number of trading strategies where I tested many variables until I found some combination that worked. Of course when I implemented them with real money they failed. That was the out of sample test. If there are a large number of signals I still believe that OOS testing is a valuable tool. Also There are performance statistics that take in the higher moments of a distribution. See:

http://www.isda.org/c_and_a/pdf/GammaPub.pdf

[url for pdf file edited per commenter--Econlib Ed.]

Joel McDade writes:

I don't know anything about fractals but I would be real curious what the author's opinion of Benoit Mandelbrot's book The Misbehavior of Markets.

He claims stock returns are a Levy distribution -- not normal or log-normal.

I completely disagree with him about selling options. Yeah when I have a loss it can be huge, but at least I'm trading the probabilities and not a random walk

David Hurwitz writes:

Thank you for another thought provoking episode.

While it might be hard to determine if individual named investors are lucky or skilled, it might be a lot easier to determine if there are skilled investors at all. Let’s go with the assumption that 1 out of 20 investors will get noteworthy results by luck. Thus, out of 10,000 money managers, 500 would be expected to get noteworthy results just by chance. Of course, by random chance you might get 550 or 450, but if it turns out that 1,000 actually got noteworthy results it would be extremely unlikely that they all got those results by luck.

If someone wanted to create a superior research-based academic journal they could have a policy where only studies that had filed with them at the beginning of the research would be published. At the beginning the researchers would state their hypothesis, and the type of results they would expect based on those hypotheses. Of course, they would also disclose funding sources and potential or real conflicts of interest. They would also initially state the types of statistical analysis they would plan to use. They would agree that if they abandoned the study along the way they would have to submit explanations in order to remain in good standing with the journal. They would also submit their data as it is generated. The journal would also have its own statisticians to crunch the numbers for themselves. The journal would publish positive as well as negative outcomes. Also, instead of anonymous peer review, there would be transparent (i.e. available before the public) methodical questioning/cross-examination by qualified experts (in a manner along the lines of a web start-up I am trying to develop whose primary purpose is to create the first effective debates [i.e. capable of disarming easily provable falsehood] on critical controversial issues in a public friendly manner).

--David Hurwitz
twitter: @DavidWonderland

gwern writes:

Mike Gomez:

> As Taleb says, take wealth. Figure out the average wealth per person worldwide without including the wealth of the richest person in the world. Imagine that distribution. Now include the richest person and calculate the average

Yes, let's take wealth...

There's ~7 billion people. Credit Suisse estimates global wealth at $241 trillion+. Forbes 2015 says the richest man is Bill Gates at $79b.

(241000000000000 - 79000000000) / 7000000000 = 34417
(241000000000000) / 7000000000 = 34428
(34417 / 34428) * 100 = 99.96%

Some fat-tailed distribution!

(And besides, as Shalizi likes to point out, most 'power law distributions' actually fit better to log-normal distributions; wealth is no exception.)

jw writes:

In response to the two posts above:

Misbehavior is one of the best books ever written on the markets. Distribution of returns are a lot closer to Levy than Gaussian. It also has a LOT of other good insights.

I believe Taleb's example is Gates walking into a football stadium, not the world, where the effect is much more pronounced (and is also a great example of mean vs median).


Igor F writes:

The following episode came do mind as I was digesting this excellent discussion:

A colleague recently described watching a film made by a Christian preacher where states that certain Astrological phenomena have throughout history coincided with significant events in the Jewish calendar. Namely, three "blood moons" (lunar eclipses) that are followed by a solar eclipse and another blood moon have happened during the Exodus, during the Holocaust, during the 1967 war, etc. The preacher claimed that another one of these Astrological patterns will be occurring shortly.

The folks at the theater were just enthralled by this seemingly unlikely but significant finding, which appeared to show that God exists and is signalling us about biblical prophesies.

This is quite analogous to the Jelly Bean fluke. The non-significant Blue and Red jelly beans here are the other spurious combinations of Astrological patterns (i.e. three blood moons and a meteor shower). So of course, if you test and test the models until you find one that's significant, you'll get a fluke.

It seems to me that the challenge of our entire human existence has been to understand our environment without falling into statistical pitfalls. This is as true of the 2,000-year-old Big Data mining of Astrology as it is in contemporary problems.

George writes:

FYI here was Buffett's response to those who thought he was "just lucky": The Superinvestors of Graham-and-Doddsville

jw writes:

George,

Maybe in 1984 when that was written, but these days Buffett has much easier ways of making money - rent seeking.

Lend critical name and financial support to a presidential candidate, candidate wins, buy one of two railroads with access to fracking fields, president refuses to allow pipeline, oil has to ship via rail (despite lives lost and environmental damage due to accidents), Buffett declares that oil on his RR has to go on new tankers built by Buffett's oil tanker subsidiary, pocket billions.

As old as government itself. Someday, a history will be written with Buffett as the 21st century's robber baron, but not by today's media.

JakeFoxe writes:

Great episode. One element that I wish had been discussed more is that the need for certainty is situational. The statistical methodology, significance, and physical mechanism for a medical trial are of the utmost importance, and must be understood and established before a decision can be made. When trying to figure out whether a blue button or a red button gets more conversions on an eCommerce website, you are willing to make a decision on the scantest of evidence. These are somewhat different types of statistical analysis in my mind.

I also get concerned when their is a push back to first principles that we may miss truly interesting novel results. Data mining can result in great avenues for investigation, and novel results should be communicated to the community even if they aren't understood, because they can inform others research or explorations. They just nee to be advertised properly-as the start of research, not the end. And, as the previous poster mentioned, a result like hand washing saves lives is worth respecting if has a relatively clear statistical upside and a hard to see downside risk.

David Hurwitz writes:

Mark K., Jake Foxe,

There is an incredible, tragic story behind the man who did figure out that washing hands saves lives.

Described as the "savior of mothers", [Ignaz Philipp] Semmelweis discovered that the incidence of puerperal fever could be drastically cut by the use of hand disinfection in obstetrical clinics. Puerperal fever was common in mid-19th-century hospitals and often fatal, with mortality at 10%–35%. Semmelweis proposed the practice of washing with chlorinated lime solutions in 1847 while working in Vienna General Hospital's First Obstetrical Clinic, where doctors' wards had three times the mortality of midwives' wards.
http://en.wikipedia.org/wiki/Ignaz_Semmelweis

In those days doctors would go straight from dissecting a cadaver to examining pregnant women without washing their hands or changing their bloody gown! Semmelweis didn't need sophisticated statistics but got immediate, profound resolution. You'd think he would have been hailed as a hero, but instead was ignored, ridiculed, and ruined. Profound arrogance, and perhaps the cognitive dissonance created with the suggestion that the doctors were killing their own patients, meant the doctors in the hospital continued their previous practice despite the overwhelming evidence.

[See also the related EconTalk podcast episode with an extended discussion of Semmelweis at http://www.econtalk.org/archives/2009/03/klein_on_truth.html. --Econlib Ed.]

George writes:

jw,

Thanks for your comment regarding the 1984 article...as well as your other very insightful comments above. I do not disagree about Buffett's recent rent-seeking phase. That said, it seems his performance, even through 1984, does require some explanation (luck vs. skill) in order to reconcile it with the Efficient Market Hypothesis.

jw writes:

So the example above was to show that Buffett is not just a gifted stock picker, that he is not above using blatant rent seeking to enhance his returns (at some cost to his reputation). He clearly has talent (or did when he started, again, markets have evolved drastically since then).

That being said, your premise is in error. EMH is nonsense. Sure, given premises and conditions, hypothetically some math works and you get a Nobel prize, but none of those premises exist in the real world.

In April of 2000 or in May of 2007 there were plenty of market analysts detailing how overvalued markets were and pointing out that we were in a bubble. The information existed and was widely disseminated. It was obvious to everyone in hindsight. So the premise that information is rapidly disseminated and provides no advantage is obviously untrue.

Besides, there are many examples besides Buffett that some very good managers exist and that have outlasted the "Lucky" equation. Unfortunately, by the time that they are proven to be the superstars, they are locked up or proprietary or exceedingly expensive (which requires even more faith that their "luck" will continue...).

Then there is the question of fundamental market change discussed above. How will their past techniques be affected by it?

Evaluating systematic trading strategies is hard. Evaluating discretionary managers is even harder given the "Lucky" factor and lack of transparency.

Go into a high net worth broker and they will give you poorly modeled, purely hypothetical portfolio allocations "based on your risk tolerance" (in truth they plug numbers into SW and you get a canned portfolio in less than a second). Ask questions about average client performance over ten years and you will get obfuscation and dodges.

Anyway, on too long again.

(FWIW, we are in a bubble now. It will NOT end well. And I have been wrong for two years and underperformed the stock market...)

David Hurwitz writes:

I think the main point about Warren Buffett's "The Superinvestors of Graham-and-Doddsville" (and Benjamin Graham's book, The Intelligent Investor) is not that there are a few super-genius outliers with great skill, but that with value investing an "intelligent investor" who engages in investing as a full time pursuit can do far better than the average market returns.

"It is extraordinary to me that the idea of buying dollar bills for 40 cents takes immediately with people or it doesn't take at all."
--Warren Buffett

The article (based on a 1984) talk ends with:

"In conclusion, some of the more commercially minded among you may wonder why I am writing this article. Adding many converts to the value approach will perforce narrow the spreads between price and value. I can only tell you that the secret has been out for 50 years, ever since Ben Graham and Dave Dodd wrote Security Analysis, yet I have seen no trend toward value investing in the 35 years that I've practiced it..."
Since then, has the even greater fame of Buffet indeed narrowed the spreads between price and value? This question can possibly be answered by following the winnowing process recommended by Graham in The Intelligent Investor (i.e. start with a stock guide and begin by looking at stocks with a PE ratio of 9 or less,...). Are there a smaller percentage of stocks that meet Graham's first-pass criteria now than in 1970 (the year Graham used for examining in the last edition of his book) ?

Also, I'm wondering if anyone has calculated what the actual (then) present values were for historical stocks with earnings data for the subsequent 30 years or so (after which there is diminishing contribution to the present value calculation). Would there be a relationship between the retrospectively calculated present values divided by the then prices and the then PE ratios?

jw writes:

"Are there a smaller percentage of stocks that meet Graham's first-pass criteria now than in 1970 (the year Graham used for examining in the last edition of his book) ? "

Yes. The strict G&D methodology is long gone. You can't find (or it is very rare to find) a stock with any long term potential for less than their cash or even book value. There were probably some in the depths of the last crash, but they were fleeting.

A quick stock screen of large caps shows just seven US companies with PE<10, and positive profits and growth. The cheapest wrt cash is Voya, which is priced at 4x cash.

So now it is all "relative value", and that is NOT G&D.

jw writes:

Edit: Apparently the editor doesn't like the "less than" sign.

- with PE's less than 10 and positive earnings and growth. The cheapest with regards to cash is Voya at four times cash.

[In html, the less-than sign on the keyboard begins a hidden command. Instead, use &lt; for a visible less-than symbol. I've fixed your previous comment. --Econlib Ed.]

Josh Marvel writes:

There was one thing I didn't understand that I'm hoping to get something insight into here. Dr. Harvey stated "I generated a series of random numbers, with an average return of zero and a volatility that mimicked the S&P 500." I don't understand why the average return needs or should be zero. Doesn't that mean that companies aren't adding value? This becomes even more interesting when Buffet is discussed because of his stance on value investing.

In my mind, I think of Apple before it releases the Iphone vs Microsoft before Longhorn. Over the long term, one added value while the other didn't. There may have been small changes in the stock day to day, but over the long term, one added value to the company while the other didn't. That's why I don't understand why the average return was based on zero, and doesn't that also provide evidence for investing where you see value?

jw writes:

The example was with respect to finding outliers in strategies that might APPEAR to be valid but are not.

So he generated 200 useless strategies (by design), but one of them turned out to APPEAR useful if someone didn't understand the limitations of the testing strategy.

The designed return of zero was only for this experiment and the results are only applicable to the point that the Prof was making about finding errors in strategy development, not the general market.

David Hurwitz writes:

Thanks JW for that information! After reading your reply I found the great Econtalk interview with Professor Fama himself!
I also read this link to Fama & French’s article on their website entitled “Luck versus Skill in Mutual Fund Performance.”

I have what seems to me a valid way to estimate how many lucky outcomes we would expect from random chance. It is just based on the assumption that the distribution of returns from 10,000 money managers would be normal if there was no skill, only luck. I'm wondering if it makes sense.

Since we presumably have over 2000 data points, each of which represents the returns of a given fund, the sample standard deviation (standard error) and the population standard deviation are the same thing. That means that z-scores can be calculated(which represent the number of standard deviations from the mean). If I had to estimate the standard deviation by the standard error I would need to use the t-statistic. With so many degrees of freedom as with a distribution of around 10,000 data points, the t-statistic gets close to looking like a normal distribution.

The link to the Fama and French website in turn has a link to a more detailed article version but I really couldn't discern answers to such questions as 1) the actual number of lucky outcomes that was predicted by chance for that population size, and 2) how many combination lucky/skillful outcomes were actually observed?

[Review: Normal Distribution 68-95-99.7 rule means that 68% of the population is within +/-1 standard deviation of the mean, 95% of the population is within 2 standard deviations of the mean, and 99.7% of the population is within 3 standard deviations of the mean].

Let’s define lucky outcomes as those that did better than 99.7% + 0.15% (one half of the remaining 0.3% outside of the 99.7%, being the lucky one of the two tails of the normal distribution) = 99.85% of the population (i.e. they are 3 or more standard deviations above the mean assuming a perfect normal distribution of the returns of 10,000 or so funds.

I’m thinking the distribution should be close to a normal distribution as a consequence of the Central Limit Theorem, because the distribution comes from summing of random variables (the returns) formed from the diversified portfolios of each fund, and the sum of random variables tends toward a normal distribution.

Thus, out of a population of 10,000 with a normal distribution we would expect 15 lucky outcomes.

I was surprised to see a term in the “Fama-French Three-Factor Model” for the “book-to-market factor. This is from Fama and French’s web site:


When we use the three-factor model to explain the monthly percent returns of the aggregate fund portfolio for 1984-2006, we get,
RPt - Rft = -0.07 + 0.96(RMt - Rft) + 0.07SMBt - 0.03HMLt + eit,
where RPt is the return (net of costs) on the aggregate mutual fund portfolio for month t, Rft is the riskfree rate of interest (the one-month T-bill return for month t), RMt is the cap-weighted NYSE-Amex-Nasdaq market return, and SMBt and HMLt are the size and value/growth returns of the three-factor model.
The regression says that the aggregate mutual fund portfolio has almost full exposure to the market portfolio (a 0.96 dose, which is close to 1.0), but almost no exposure to the size and value/growth returns (0.07 and -0.03, which are close to zero).

At seeing the book-market effect seemingly marginalized as "close to zero" I found a more pronounced contribution of the term attributed in Principles of Corporate Finance, Brealey and Myers, 7ed, 2002:

“Since 1928 the average annual difference between the returns on value and growth stocks has been 4.4 percent.”

4.4% is a rather large difference, so I am confused why the aggregate mutual fund portfolio was described as having “close to zero” exposure to the “value/growth returns".

Also low book to market price doesn’t really meet all the Graham & Dodd criteria (reference: page 209, 4th revised edition of The Intelligent Investor under the section entitled: “A Winnowing of the Stock Guide”). By including “earnings stability, some current dividend, and earnings growth” as additional criteria, Graham seems to be buying such a bargain that even if the stock is held a long time, the present value of the future earnings stream will justify continuing to hold the stock as long as the G&D conditions hold, even if the price of the stock remains "underpriced". Thus, you can buy more future earnings per investment dollar, and there is a rational physical reason behind G&D value investing.

Graham would exclude a low PE stock that was not paying dividends out, though such a stock might be counted in Fama and French as a “value” stock based on book-to-market price. The Fama and French draft, lists Oct 2007 as the first draft date, and the updated date is given as Dec. 2009. As JW said, by then the G&D value opportunities were gone, so one wouldn’t expect to see active G&D value funds then. If there were at some point, for instance, 10 G&D value funds out of 10,000 funds, then it would be unlikely that even one would be in the top 0.15% according to the efficient market hypothesis.

At some point between Buffett’s 1984 talk when he said value investing was still relatively uncommon, and when there were few value opportunities left, there would have been strong demand pushing up the prices of the value stocks until they were no longer value stocks, as there was more significant competition for what value opportunities remained. If the average G&D stock was purchased in 1984 at an assumed 40 cents on the dollar, and sold in 2000 for 80 cents on the dollar, there would have been an extra 4.4% annualized return for those that held G&D value stocks since 1984. It might be interesting to calculate what the return of a fund of the 150 G&H stocks (as estimated by Graham in 1970), meeting the six G&D “criteria of selection,” would have achieved if purchased in 1970 and sold, in say, 1984.

jw writes:

I haven't done the analysis, but my assumption would be the the manager returns would be anything but normal. They would probably be highly clustered close to the median return (leptokurtic), but a little lower as they do incur costs. Professional mangers are typically incentivized to slightly outperform the indexes as it puts them in the optimal position to gather assets and not lose their jobs. They don't always accomplish this.

Also, Fama's work on Three Factors was entirely in sample. This is a common problem with investment theory (and climatology) research.

As an example, the Dogs of the Dow is a well known strategy with a couple of decades of out of sample testing and it has generally held up well with respect to returns with slightly higher volatility than the Dow but less than the S&P. It is also based on a valuation concept. (Again, this is not a recommendation, but an example...)

Gav G writes:

jw,

Thanks for you in-depth 2nd comment regarding the various areas that need to be considered when testing the validity of a trading/investing strategy. I'm guessing you've got personal experience in testing/developing trading systems too.

Of the ones you listed, the mistake I see most often amongst trader friends is related to "degrees of freedom". In my experience, many seem to wrongly think they need to include more and more variables, as having just a few is deemed "too simple" or has "too much noise".

One tool you didnt mention is monte carlo simulations. Monte carlo simulations are used to approximate the probability of certain outcomes by running multiple trial runs (simulations) using variables that are similar (or differ only slightly) to those from your original test. Basically, it is asking the question, "What if the past had been slightly different?"

jw writes:

Gav,

Since you asked, I would start with:

Misbehavior of Markets - Mandelbrot
Evidence based Technical Analysis - Aronson
Trading Risk - Grant
Schwager on Futures - Schwager

In addition, there are thousands of research papers and reports on the web for free, after reading a number of them, you will quickly be able to decide which are useful and which are not.

I would start with anything by Cliff Arness, David Harding, Kyle Bass, Albert Edwards and John Hussman's weekly column.

Robert Ferrell writes:

The announcement of the Higgs "discovery" "http://cms.web.cern.ch/news/observation-new-particle-mass-125-gev", was really an announcement that nobody had a better explanation for the observed data:
"The range of 122.5–127 GeV cannot be excluded because we see an excess of events in three of the five channels analysed:"

Financial strategists who claim their system is desirable have it backwards. A valid statement might be: "Given the tools, knowledge and data available, we cannot exclude the possibility that returns from our investment strategy will not exceed returns from the S&P over the next 5 years."

Dr Harvey suggests that by including non-zero skew his probability estimates are more valid. To continue the analogy with the Higgs Boson, in that case the distribution of energies (experimental results) was predicted and "known". (Hence "5 sigma" has a precise meaning) In financial modeling, the distribution of returns is rarely known, so even including skew in the parameterization probably does not provide a good model (except by luck). Insisting on more sigmas from the wrong distribution does not necessarily lead to a more robust result.

David Hurwitz writes:

I was intrigued after reading JW’s comment that his “assumption would be the manager returns would be anything but normal,” so I went back and did some homework trying to make sense of a topic I am unfamiliar with (but got frustrated with my lack of progress). I had a hard time trying to find sources for what the distribution of stock returns over time looks like, but this article includes some of what I was looking for.

Admittedly, I still was not able to understand “Luck versus Skill in the Cross-Section of Mutual Fund Returns.” It would be great if Professor Fama returns to econtalk and advance questions were considered from the listeners! In addition to listening to Professor Fama on econtalk, I also found the videos for a Yale course given by Economics Nobelist Robert Shiller. Professor Shiller expressed his skepticism for the Efficient Market Hypothesis in this lecture: http://oyc.yale.edu/economics/econ-252-08/lecture-6

It seems to me that "information" alone isn’t enough of a criteria for an efficient market. There may be infinite sources of information and yet most investors probably act on just a few signals, and disagree about which are most important, and how to weigh them. Furthermore, the combinations of the signals (such as the criteria used by Graham) lead to even more possibilities. If pricing is based on “information”, why do the buyer and seller disagree on the value and future prospects of a stock? Would there be reasons other than EMH to explain why managed mutual funds don’t seem to do better than index funds (for example the need to show short term results, and of course lots of dumb luck confused with skill)? If we went back and looked at the 150 companies that Graham said met his first pass criteria, and looked at their subsequent performance, would there be evidence against EMH if those stocks as a whole performed far better than the market? Would significantly non-normal probability density distributions of individual stock returns be evidence against EMH?...

Comments for this podcast episode have been closed
Return to top