Why Most Of The Studies You Read About Are Wrong
Stephen Ziliak of Roosevelt University and Deirdre McCloskey of University of Illinois at Chicago have done the world of academic research the greatest, but least welcome, of favors. In their book, The Cult of Statistical Significance, in their own journal articles, and in the forthcoming Oxford Handbook of Professional Economic Ethics, they’ve meticulously gone through many thousands of journal articles and subjected them to scrutiny as to the soundness of statistical methods. Their survey uncovered an epidemic. Depending on the science of the journal (economics, medicine, psychology, or other) and the type of statistical abuse (use of sampling techniques where no samples have been taken, omission of definition of the unit of the metric tested, misinterpretation of the definition of ‘statistically significant’, absent of any information describing the magnitude of the effects, etc.) somewhere between 8 and 9 out of every 10 journal articles in the leading journals of science have serious flaws in their use of statistics. In that sense, the results of articles published in the likes of the American Economic Review and New England Journal of Medicine are not just wrong in their statistical logic; they are worse than wrong, lacking enough statistical coherence and common sense interpretation of causes and effect sizes to even know what the reasonable conclusions are and how to evaluate them.
A lot of this is non-debatable. Statisticians know the rules: sampling methods are for situations in which there is sampling. P values—the most commonly used index for determining the “statistical significance” of a result—do not measure the probability that the theory is right nor do they tell the practical importance of the result, not even close. Researchers should tell us the actual effects of a drug or fertilizer or economic policy in a clearly defined unit in a clearly defined unit. If a researcher tells us that unemployment goes down when the government increases stimulus spending, they should start by telling us how they are measuring unemployment and how much it changes with X% more stimulus: in hours worked? Minutes worked? Millions of people employed? Thousands…hundreds…tens? How much? When they tell us that some particular stock or bond characteristic causes higher returns, how much higher? When they say they are adding or reducing risk, how much risk, measured how? Statistics professors know that business and medical research students often get just enough information from their classes to be dangerous, enough to calculate ‘statistical significance at the 95% level’, but not enough to know what that actually means, what it proves and what it never can prove. FiveThirtyEight recently asked several prominent academics, on camera, to explain the P value. The results were less than confidence-inspiring, no matter what one’s confidence interval.
That’s a pretty big problem measured in any reasonable unit. P values rule our world. Anything that’s based on journal articles, medicines, weight loss studies, longevity data, public policy, class-action law suits. Every pitch I’ve ever seen from a money manager or consultant who is pushing an investment solution includes the claim that the results are statistically significant to a 95% level, or even a 99% level. Fortunes and even lives rise and fall with these things called P values. Ziliak and McCloskey persuasively argue they should not. P values do not and cannot provide the information we actually want, which is a sound set of interpretations of effect sizes and their meaning to actual decision making.
Here’s my shot at defining this variable upon which we all depend:
When a researcher does a study, he or she is usually dealing with two theories. One is their real theory (for example, momentum drives stock returns or this diet pill helps with weight loss) and the other is ‘the null hypothesis’ (momentum has nothing to do with stock returns or this diet pill has no effect on weight loss). In the diet pill example, the scientists give some people the real compound and some people get a placebo. Since it’s not practical to divide mankind into two groups of 3.5 billion each and give the diet pill to one group and the placebo to another, then some sample must be used. There are two schools of thought on the sample question: ‘randomistas’ (as they’re called by this year’s Nobel Laureate in economics, Angus Deaton) and balanced designers. Randomistas have been gaining influence in the policy arena for the past twenty years—particularly in poor and developing nations, where they conduct most of their randomized trials. Randomized controlled trials have been treated by journal editors as the ‘gold standard’ of publishability. ‘Double-blind, random-sample’ is a VIP pass to getting in print. The minority report (but growing) has been that random sampling allows biases to creep into the testing unexamined and that it’s better to design experiments so that variables such as age and gender and class are balanced. In other words, we can choose people randomly, in which case we’re bound to randomly get more boys than girls or vice versa (as happened in a famous study of Chinese students’ grades with and without eye glasses). Or, instead of letting the random number generator choose people, you choose people in ways designed to balance various characteristics. Otherwise you can get a situation in which girls turn down glasses and boys take them and then we’re really studying differences in grades between girls and boys when we think we’re studying the difference between glasses and the naked eye. This debate is still going on.
The problems are even worse in econometrics and finance, in which it is common to not sample at all. If a researcher has a data set, say Schiller’s PE values for the past century and those are compared to stock market returns, then there isn’t any real sampling. The whole universe of available data is used. In a case like that P values are not really applicable at all, and yet there they are plastered all over articles, and prospecti. Prospects sit through Power Point purgatory and what holds them there is the all-powerful P, which allegedly makes it all worthwhile.
Back to defining the elusive and dictatorial P values. Since we can’t test 7 billion people, we need some idea whether the 70 people we sampled give us results which are reasonably close to the 7 billion. So we average our samples and we look at the results and we ask this question: if the null hypothesis were true (if the diet pill has no effect or if eyeglasses do not affect grades or if GDP has no effect on stock market returns) how likely is it that we would get a result (a difference from the null) larger than the result we’ve actually got? The P value is a numerical answer to that question. The usual standard is a P value of 5% of less. If the calculated P value is considered low by some bright-line standard, below 5% (0.05), for example, then the result is called “statistically significant”; if not, not. That is, if P > 0.05, the result is said to be too likely attributable to random chance, and is neglected, perhaps not even published, no matter the real world effect–what Ziliak and McCloskey call the “oomph”–of the result. Ironically, this way of measuring randomness was chosen randomly. Way back in 1925 R.A. Fisher, the father of the P value, simply decided that 5% was the threshold which henceforth and forevermore would determine statistically significance and thus (however erroneously) “importance,” so let it be written in the textbooks, so let it be done in society.
The P value doesn’t do the job it’s been charged with. Most people want to know: Does the diet pill work? How much will I lose, and with what probability? How sure are you that it will work? P values answer none of those questions. Instead researchers test the thing which we’re not really testing and we use a statistic which doesn’t tell us how big the effects are and our final vindication is a single number which is usually being used the wrong way. Many people, some of whom write journal articles, think this is the way it’s supposed to work, but it isn’t.
And it seems like the vast majority of general interest journalists who write articles which summarize journal studies get it wrong. A P value is not a test of whether wine and chocolate help people live longer or whether minimum wages raise unemployment rates. It doesn’t tell you whether increases in money supply increase gold prices. It doesn’t test the theory. It tests the non-theory. At best, the P value tells the likelihood of observing a larger difference in the sample between diet pills A and B, assuming there is no difference between the pills. But that is not saying very much at all, even when P falls below the arbitrary 5 or 1 percent threshold. Small, unimportant differences will appear statistically significant, and large and important differences might escape detection on grounds that they are “insignificant” statistically speaking according to some censorious rule cooked up by a statistician or bureaucrat.
If the null hypothesis is tested with a P value less than 5%, it might mean that chocolate adds longevity or it might mean that something else added longevity, something which was not controlled for because the people were randomly chosen and randomness must rule in the land of the randomistas. It might even mean that the null hypothesis actually is actually correct. A P value of 5% means that you would erroneously declare “significance” one out of twenty times if the experiment is repeated (Ziliak and McCloskey and others have found they rarely are) and the sample was unbiased; the chocolate had no effect. But that also means that you would get these results, even if the null hypothesis was wrong 1 out of 20 times! You might just be unlucky enough to be reading the study which says that loads of alcohol are no problem for your liver, based on the 1 sample in 20 which would erroneously show that result. A P value of 5% basically says one time out of twenty a study using this method will give you the wrong conclusion. Be unlucky enough to read a study before 2008 and the data will show you that real estate is a very stable asset class and it will do it with an undeniably compelling P value.
Most of these criticisms of studies are non-controversial among professional statisticians. They know that statistical significance and P values are routinely abused. People who write these studies are not usually statisticians, though. They are medical researchers or economists or financial analysts who are using (or quite often misusing) statistical theory. The American Statistical Association is about to release a public statement from its sub-committee on P values (Ziliak is on the committee) to address the problem. But in the meantime, our lives, our laws, our planners, our doctors, our health insurers, our portfolio managers are governed by studies that are giving them a false sense of confidence about their conclusions, and then betting our assets and even our lives on that unstable foundation.
I sat down across a Skype line with Professor Ziliak recently to talk about nerdy statistics, beer, Guinnessometrics, weight loss, randomistas, economists keeping glasses from children who need them, race and IQ and how bad statistical theory empowered eugenics. A partial transcript (edited for clarity) can be found below:
BOWYER: I suppose if somebody were to ask me about this McCloskey-Ziliak school of statistical revision, and they wanted me to sum it up in a sentence that would sort of capture attention, it would be: here are the reasons why most of the studies you read are wrong.
BOWYER: So let me ask it to you that way. Are most of the studies that we read in academic journals, studies that are done by economists, or in medical journals, are most of them in some significant way, when it deals with their statistical methodology, wrong?
ZILIAK: They might be ‑‑ a large fraction of them might be quite wrong in their conclusions. What we can document and have documented already is that eighty to ninety percent of all published articles — and it varies by field from economics to medicine, but what we’ve shown from our decades-long surveys of the major journals of science from economics to medicine is that eighty to ninety percent of them have no idea if they’re right or wrong. Reason: they’re focusing on the wrong object. They’re focusing on statistical significance rather than on what they should focus on, which is economic, or medical, or psychological, or wellbeing significance.
So if I was going to give you the one-liner, the one sentence to tell people who ask you in the elevator, I would say, as we say in the book, “Precision is nice, but oomph is the bomb.” What you really want to know is about the oomph of your coefficient, the oomph of your economic inference, not how much you’ve controlled one kind of error ‑‑ random error measured by statistical significance.
BOWYER: So what you’re saying is we don’t even know enough to know whether they’re wrong. Their methodology is such that it’s difficult to say whether they’re right or wrong because they’re looking at the wrong thing in order to support their conclusions.
ZILIAK: Exactly. So, for example, one thing you need to know in order to make a quantitative assertion, and to defend that assertion in a company of peers, scientific and otherwise, is basic measurement of descriptive and summary statistics. And in our surveys of the economics literature, McCloskey and I were alarmed to find the large fraction of papers don’t even tell you in what units the variables are being measured, let alone give averages and standard deviations so you can begin to figure out the real economic magnitude of their correlational and coefficients or their regression parameters, and that kind of thing. So without that basic information from units of measurement to means, medians, standard deviations and other looks at the distribution of the data, we just have no idea, Jerry. We don’t know if they’re right or wrong.
BOWYER: So if the studies say ‘if we follow this policy we’ll get a reduction in unemployment’, we don’t know whether this reduction in unemployment is in unemployment rate, unemployment hours, unemployment seconds, unemployment nanoseconds. Right? When we say ‘spend this much on a stimulus project, we have statistically significant evidence that it will decrease unemployment’, and the study doesn’t even tell you whether we’re decreasing it by minutes, seconds, or millions, or ones or twos. We might be putting one person on the payroll, or we might be putting ten million people on the payroll; without putting your output in a unit, without saying what the unit is, we have no idea what the oomph is.
ZILIAK: Absolutely. Absolutely. It would be as if you’re taking your mother for Sunday brunch and the car runs out of gas, and you ask mom to push the car. She might actually move it a small centimeter or two, but it’s probably not the best decision.
BOWYER: She’s never going to get you there.
BOWYER: Right. My mom’s tough; she goes for long walks every day, but she’s not going to push my car ten miles.
ZILIAK: I think my mom would go back to hitting me with the shoe rather than do that.
BOWYER: All right. So this is the whole debate that we see in your book with Deirdre McCloskey, “The Cult of Statistical Significance”, and some of this material is also in your contribution to “The Oxford Handbook on Professional Economic Ethics”, which is just coming out now.
There’s this ongoing debate that actually goes back more than a century between something called statistical significance — that’s a kind of term of art — and oomph, which is your own term of art. What are those two things? What is statistical significance and what is oomph?
ZILIAK: Sure, yeah, good question.
Statistical significance, let’s start with that first. So we all want to distinguish random from real error. You know sometimes people make deliberate or undeliberate, that is unconscious blunders. This could be measurement error, using incorrect units of accounting or incongruous units of accounting; that’s especially common. In experiments on beer, you might be using very different types of barley malt or of hops that interacts differently with each other. And some of the errors that you make will be real, and some will be random. And the idea of the test of statistical significance is to try to estimate to what extent you’re looking at random variation. And of course, the idea is to try to minimize that.
Statistical significance is the most widely used technique for saying whether or not a finding is important or not. The problem with that is that statistical significance is neither necessary nor sufficient for proving commercial financial, spiritual, psychological, medical or other kinds of significance ‑‑ statistical significance cannot do the job.
So what is statistical significance? Let’s say that you have a well-designed experiment or survey or historical study that generates data on some observations that you care about. This could be regarding unemployment and inflation, minimum wage and employment. It could be about the way one drug versus another drug affects people’s psychological well-being or their body weights. Now, you do some basic measurements and you compute averages and medians and standard deviations, that is the variance of the data. And you also calculate correlation coefficients between variables such as inflation and unemployment. And that correlation coefficient is going to have a certain magnitude. The closer it is to one the more the variables, such as unemployment and inflation, move together; the closer it is to zero, the less they move together, the less you know about their relationship.
Well, the test of statistical significance tries to tell you something about the likelihood of seeing the correlation that you’re actually looking at. And so the idea is to imagine — and that’s unfortunately what most economists and other scientists do — they imagine a repetition of their experiment or of their survey, or whatever; they don’t actually repeat the study. That’s something that’s really come out in this reproducibility crisis.
But, anyway, the point is that some of that correlation coefficient that you’re looking at is real, and some of it occurred for random reasons. And the test of statistical significance tries to put a numerical figure on the amount of it that would be due to merely chance, or merely random reasons.
Now, you know, so one can think about many different examples. But in general, what you’re trying to do with the P‑value is to say that object A is different somehow from object B. And the way it gets worked out on the technical side goes something like this. A null hypothesis is proposed. And the null hypothesis means, in ordinary English, that there is no average difference, no difference in average or median effects between let’s say diet pill A and diet pill B, or between Irish barley Archer and English barley Archer. So the null hypothesis stipulates in advance that there’s no difference in those two things, let’s say in the amount of weight loss ability from the pill or in yield of barley corn out in the fields across the two different barleys. Okay.
ZILIAK: Then the p-value does what? It says what’s the probability of seeing a bigger difference in effect sizes between the two things being compared? What’s the probability of seeing a bigger effect size than the one that I’ve actually seen, that effect size being divided by the variation of the data.
So let me back up and say it one other way. Historically speaking, the test of statistical significance was called student’s test of significance, student’s t-test of statistical significance. And the t-test is measuring a thing that a lot of us can appreciate, and that is it measures the difference between object A and object B, divided by the variation within object A and object B. And of course under the strictures of the sample or the way that the data were experimentally designed. So given the experimental design, what’s the variation within those two objects, those two things being compared?
The p-value is saying something else. It says ‘assuming that there’s no difference between the two weight loss pills or the two barley yields, what’s the probability of seeing a difference larger than the difference we’ve actually observed in the data?’
BOWYER: So basically you assume what you’re pulling for not to be the case in doing a journal article, right?
ZILIAK: Absolutely. Which sounds kind of crazy, why are you investing all that money and time in testing this thing if you don’t already believe it?
BOWYER: Why are you testing the hypothesis that you’re not trying to prove? It’s almost like you’re not really testing the hypothesis that you want to be testing.
ZILIAK: Exactly. That’s exactly correct, and that’s what Bayesian statisticians have been arguing for two centuries.
BOWYER: That we should stop testing the hypothesis is that these diet pills make difference. It’s interesting you’ve chosen two examples: diet pills and beer brewing, Guinness brewing. Those are in conflict with one another, you understand, right.
ZILIAK: A lot of people think so but it’s — there’s no conflict of interest here, because I don’t take money from Guinness. But it turns out that a pint of Guinness has fewer calories than a pint of orange juice.
BOWYER: All right, I’m sold.
And having bought your hypothesis I’m likely to need in the future to buy diet pills.
So let’s talk about the diet pills example. So the statistical testing is set up so that the theory being tested is that the diet pills make no difference, that there’s no effect from the diet pills. And then you run a bunch of tests, or maybe you run one test, and then you take the data from that test and use it to run imaginary tests in a Monte Carlo stimulation or something like that. But it’s not actually more tests. Maybe you just ran one set of tests; that’s a very good point you make. So let’s say maybe you even do it better, you do a bunch of tests; you take a bunch of samples. And then you get data. You look at that data and you say ‘if the diet pill made no difference, how likely is it that I would have gotten these results?’ And if you find that it is five percent likely, a one in twenty chance, that you would have gotten these results or results that are more extreme, then the null hypothesis fails the test.
BOWYER: The it-doesn’t-make-any-difference hypothesis fails the test.
BOWYER: And then, here’s a leap in logic, therefore, your theory is true, because the alternative to your theory, the it doesn’t make any difference theory, is false.
BOWYER: Do I have that right?
ZILIAK: Yeah. That is precisely the line of reasoning behind the cult of statistical significance; that is exactly right.
BOWYER: All right. Let’s vivisect it for a moment here. The p-value is the probability of getting that result or a result that’s more extreme that is further from the expected value. And it can be extreme on one end, or it can be extreme on either end, right? It can be off on the right of the bell curve or it can be ‑‑
BOWYER: Right. So it can be two-and-a-half percent on either end of the bell curve, or it can be five percent on the upper end of the bell curve.
BOWYER: Let’s say that you run a study and it’s not actually a random or a balanced sample; you just run a study. Is the p-value valid then even in p-value terms? You understand what I’m asking?
ZILIAK: In my opinion, it’s not very valid for any purpose at all. It’s not logical, you know. If P implies Q you can’t turn around and say that Q implies P unless you’ve provided some reasons that that’s exactly the case.
BOWYER: I agree. But I’m going to ask you to defer the logical critique for a moment.
BOWYER: I understand that you’ve got a deformed form of modus tollens, right? You’ve got a fallacy of the transposed conditional. Right? If A, therefore B, does not imply: B, therefore A, right? Asserting the consequent, I think is what the old textbooks call it.
ZILIAK: Yes, that’s right.
BOWYER: So that’s a big problem. But I’m not even there yet. I’m just saying…let’s use an example that you see a lot in economics: There’s twenty years of data available on something. I go and run my twenty years of data. And I get a p-value of less than five percent. The p-value seems to me to be assuming that you’ve got a large enough sample, that you’ve got a sample that’s either random or otherwise balanced. But if you take your whole universe of available data, which is what almost everybody does that I see, then in what sense is it even meaningful to be looking at a theoretical bell curve distribution that then you compare your results to? You see what I’m driving at?
ZILIAK: Yeah. What you’re talking about is a familiar problem. It happens all the time in a lot of different areas of science, including economics. But people seem blatantly unaware of it. If you have the whole population, you have the whole universe, you don’t actually have a sampling problem. If you don’t have a sampling problem, then you don’t have any use for a test of statistical significance, which is designed with the idea in place that one has not only a sample of a larger population, but one has taken multiple samples, enough in order to justify using distribution theory of probability and, also, a test of statistical significance for making the inference.
Originally posted on Forbes.