About Rationally Speaking

Rationally Speaking is a blog maintained by Prof. Massimo Pigliucci, a philosopher at the City University of New York. The blog reflects the Enlightenment figure Marquis de Condorcet's idea of what a public intellectual (yes, we know, that's such a bad word) ought to be: someone who devotes himself to "the tracking down of prejudices in the hiding places where priests, the schools, the government, and all long-established institutions had gathered and protected them." You're welcome. Please notice that the contents of this blog can be reprinted under the standard Creative Commons license.

Wednesday, January 05, 2011

The problem of replicability in science

from xkcd

by Massimo Pigliucci
In recent months much has been written about the apparent fact that a surprising, indeed disturbing, number of scientific findings cannot be replicated, or when replicated the effect size turns out to be much smaller than previously thought.
Arguably, the recent streak of articles on this topic began with one penned by David Freedman in The Atlantic, and provocatively entitled “Lies, Damned Lies, and Medical Science.” In it, the major character was John Ioannidis, the author of some influential meta-studies about the low degree of replicability and high number of technical flaws in a significant portion of published papers in the biomedical literature. As Freedman put it in The Atlantic: “80 percent of non-randomized studies (by far the most common type) turn out to be wrong, as do 25 percent of supposedly gold-standard randomized trials, and as much as 10 percent of the platinum-standard large randomized trials.” Ioannidis himself was quoted uttering some sobering words for the medical community (and the public at large): “Science is a noble endeavor, but it’s also a low-yield endeavor. I’m not sure that more than a very small percentage of medical research is ever likely to lead to major improvements in clinical outcomes and quality of life. We should be very comfortable with that fact.” Ouch.
Julia and I actually addressed this topic during a Rationally Speaking podcast, featuring as guest our friend Steve Novella, of Skeptics’ Guide to the Universe and Science-Based Medicine fame. But while Steve did quibble with the tone of the Atlantic article, he agreed that Ioannidis’ results are well known and accepted by the medical research community. Steve did point out that it should not be surprising that results get better and better as one moves toward more stringent protocols like large randomized trials, but it seems to me that one should be surprised (actually, appalled) by the fact that even there the percentage of flawed studies is high — not to mention the fact that most studies are in fact neither large nor properly randomized.
The second big recent blow to public perception of the reliability of scientific results is an article published in The New Yorker by Jonah Lehrer, entitled “The truth wears off.” Lehrer also mentions Ioannidis, but the bulk of his essay is about findings in psychiatry, psychology and evolutionary biology (and even in research on the paranormal!). In these disciplines there are now several documented cases of results that were initially spectacularly positive — for instance the effects of second generation antipsychotic drugs, or the hypothesized relationship between a male’s body symmetry and the quality of his genes — that turned out to be increasingly difficult to replicate over time, with the original effect sizes being cut down dramatically, or even disappearing altogether. Again, the implications are disturbing. As Lehrer concludes at the end of his article: “Such anomalies demonstrate the slipperiness of empiricism. Although many scientific ideas generate conflicting results and suffer from falling effect sizes, they continue to get cited in the textbooks and drive standard medical practice. Why? Because these ideas seem true. Because they make sense. Because we can’t bear to let them go. And this is why the decline effect is so troubling.”
None of this should actually be particularly surprising to any practicing scientist. If you have spent a significant time of your life in labs and reading the technical literature, you will appreciate the difficulties posed by empirical research, not to mention a number of issues such as the fact that few scientists ever actually bother to replicate someone else’s results, for the simple reason that there is no Nobel (or even funded grant, or tenured position) waiting for the guy who arrived second. Still, in the midst of this I was directed by a tweet by my colleague Neil deGrasse Tyson (who has also appeared on the RS podcast, though in a different context) to a recent ABC News article penned by John Allen Paulos, which meant to explain the decline effect in science.
Paulos’ article is indeed concise and on the mark (though several of the explanations he proposes were already brought up in both the Atlantic and New Yorker essays), but it doesn’t really make things much better. For instance, Paulos suggests that one explanation for the decline effect is the well known statistical phenomenon of the regression toward the mean. This phenomenon is responsible, among other things, for a fair number of superstitions: you’ve probably heard of some athletes’ and other celebrities’ fear of being featured on the cover of a magazine after a particularly impressive series of accomplishments, because this brings “bad luck,” meaning that the following year one will not be able to repeat the performance at the same level. This is actually true, not because of magical reasons, but simply as a result of the regression to the mean: extraordinary performances are the result of a large number of factors that have to line up just right for the spectacular result to be achieved. The statistical chances of such an alignment to repeat itself are low, so inevitably next year’s performance will likely be below par. Paulos correctly argues that this also explains some of the decline effect of scientific results: the first discovery might have been the result of a number of factors that are unlikely to repeat themselves in exactly the same way, thus reducing the effect size when the study is replicated.
But that’s not all. Another major determinant of the unreliability of scientific results mentioned by Paulos is the well know problem of publication bias: crudely put, science journals (particularly the high-profile ones, like Nature and Science) are interested only in positive, spectacular, “sexy” results. Which creates a powerful filter against negative, or marginally significant results. What you see in science journals, in other words, isn’t a statistically representative sample of scientific results, but a highly biased one, in favor of positive outcomes. No wonder that when people try to repeat the feat they often come up empty handed.
A third cause for the problem, not mentioned by Paulos but addressed in the New Yorker article, is the selective reporting of results by scientists themselves. This is essentially the same phenomenon as the publication bias, except that this time it is scientists themselves, not editors and reviewers, who don’t bother to submit for publication results that are either negative or not strongly conclusive. Again, the outcome is that what we see in the literature isn’t all the science that we ought to see. And it’s no good to argue that it is the “best” science, because the quality of scientific research is measured by the appropriateness of the experimental protocols (including the use of large samples) and of the data analyses — not by whether the results happen to confirm the scientist’s favorite theory.
The conclusion of all this is not, of course, that we should throw the baby (science) out with the bath water (bad or unreliable results). But scientists should also be under no illusion that these are rare anomalies that do not affect scientific research at large. Too much emphasis is being put on the “publish or perish” culture of modern academia, with the result that graduate students are explicitly instructed to go for the SPU’s — Smallest Publishable Units — when they have to decide how much of their work to submit to a journal. That way they maximize the number of their publications, which maximizes the chances of landing a postdoc position, and then a tenure track one, and then of getting grants funded, and finally of getting tenure. The result is that, according to statistics published by Nature, it turns out that about ⅓ of published studies is never cited (not to mention replicated!).
John Platt, in a classical article published in Science in 1964 (the year I was born), famously wrote: “Scientists these days tend to keep up the polite fiction that all science is equal. Except for the work of the misguided opponent whose arguments we happen to be refuting at the time, we speak as though every scientist’s field and methods of study are as good as every other scientist’s, and perhaps a little better. This keeps us all cordial when it comes to recommending each other for government grants. ... We speak piously of taking measurements and making small studies that will ‘add another brick to the temple of science.’ Most such bricks lie around the brickyard.”
Most damning of all, however, is the potential effect that all of this may have on science’s already dubious reputation with the general public (think evolution-creation, vaccine-autism, or climate change). This sentiment was expressed cogently by Ioannidis, again in the Atlantic article: “If we don’t tell the public about these problems, then we’re no better than non-scientists who falsely claim they can heal. If the drugs don’t work and we’re not sure how to treat something, why should we claim differently? Some fear that there may be less funding because we stop claiming we can prove we have miraculous treatments. But if we can’t really provide those miracles, how long will we be able to fool the public anyway? The scientific enterprise is probably the most fantastic achievement in human history, but that doesn’t mean we have a right to overstate what we’re accomplishing.” Good questions, and good point.


  1. But is any of this new for science? Perhaps science has operated this way all along, full of fits and starts, mostly duds. How do we know that this isn't the optimal way for science to operate?

    My issues are with the understanding of science that high school graduates have, and with the reporting of science.

    Students only learn about the science that stuck. They don't learn -- or understand -- that the majority of science doesn't stick. They see science as a body of facts and not as a messy process. Leaving school, they see all these reports contradicting each other and conclude that science today is lousy. They learn to pick their truths.

    Reporters mean to titillate. They compete for audience. Reporting science accurately, qualifying everything, would undermine their primary objective of stimulating. They reinforce misunderstandings of science, and by providing unqualified contrary reports, take advantage of the public's lack of understanding to further erode their trust in science.

  2. Besides natural issues such as regression to the mean, other more substantive processes may be happening.
    The various currents of "natural epistemology" converge to the idea that in a certain environment (free debate, criticism, competition of differing views, replication, and so on) scientific research will produce growth of knowledge. The mechanism may be phrased in sociological, natural selection , or economic terminology, but the general idea is that such setting would generate healthy science.

    What seems to have emerged in recent decades is a change in the institutional setting that got science advancing spectacularly since the establishment of the Royal Society. Flaws in the system such as corporate funded research, pal-review instead of peer-review, publication bias, science entangled with policy advocacy, and suchlike, may be distorting the environment, making it less suitable for the production of good science, especially in some fields.

    Remedies should exist, but they should evolve rather than being imposed on a reluctant sociological-economic science establishment driven by powerful motives such as professional advance or funding. After all, who or what would have the authority to impose those rules, other than the scientific establishment itself?

    This are just random thoughts. I do not have an answer to these questions.

  3. Science (at least in abstract) is not seen as dubious by the American public: http://pewresearch.org/pubs/1276/science-survey

  4. @ Massimo

    I had the same thought at Michael Madson, and found the same survey results from 2009.

    In fact:

    "scientists are very highly rated compared with members of other professions: Only members of the military and teachers are more likely to be viewed as contributing a lot to society's well-being."

    I'd be interested in hearing more about how your opinion of how scientists are viewed interacts with the survey data.

  5. JJE, I've seen the survey data, but this seems like a clear case of disconnect between the abstract and the actual. Generic scientists are highly rated, but the public keeps behaving in a way contrary to scientific advise. The latter, unfortunately, is far more important than a generic endorsement of the profession.

  6. This comment has been removed by the author.

  7. @ Massimo

    I tend to agree. People tend to resist conclusions that run contrary to their beliefs. As a result, in specific cases, scientists are distrusted regardless of their professional standards of behavior. And in others, they are admired without skepticism. (Few people would distrust a chemist or a crystallographer while many distrust cosmologists and evolutionary biologists.) I think the toxic environment is an issue of the people evaluating scientists, not the scientists.

    Which is not to say (as Ioannidis points out) that science doesn't have massive areas for improvement. It certainly does. But I differ quite a bit on interpretation. In fact, I don't see these so-called problems as flaws at all. I see them as problems that remain to be solved in the honing of science as a tool. Scientific practice has generally improved as our understanding in various fields have matured, including instrumentation, mathematics, statistics, electronics, etc. Ioannidis, coming from a statistical perspective, is using a very mature perspective to criticize results on a very high level. And he should! But without the vast improvement of statistics in the last century, we likely wouldn't even be making the observations that are being criticized to begin with, much less have the tools to call into question those very same observations.

    With each new vista in science, new challenges arrive, not just on the scientific level, but on the social and philosophical levels. And they can be big and can stand in the way of further progress. But is this a reason to feel bad for science? I don't feel that way. It just sounds like the scientific tool needs even more sharpening. Yay for science.

  8. JJE, I agree, I take it the problem isn't (necessarily) with science, but with the overinflated claims that some scientists (and certainly a lot of science journalists) make on behalf of science. That isn't going to help public understanding of science, and eventually may dramatically undermine public trust.

  9. I think that while the claims are not new, many science are not fully aware of all these biases or take them into considerations. One way to help them will be by haveing teaching at least once class on the philosophy of science as part of their training.

    One other problem is the culture of scicene. As mentioned, scienitist are pushed to publish almost anything. On one hand, a lot of garbage is published, never cited or replicated, but if we start telling people what to research aren't we limitied the possibility of a new, very innovative breakthrough once in awhile?

  10. As someone who, as a student, struggled at science (and who sometimes barely passed the course), I did not truly come to appreciate the subject until well into my adulthood - after having experienced how just how unreliable and/or implausible "alternative" (read: non-science-based) claims to knowledge can be. Still, I understand that it's a flawed human endeavor, like any other.

    I'm tempted to adapt a famous Churchill quote here: Science is the worst knowledge-seeking enterprise (or whatever-you-want-to-call-it) -- except for all of the others.

  11. Massimo says, and I fully agree, that "the problem isn't (necessarily) with science, but with the overinflated claims that some scientists (and certainly a lot of science journalists) make on behalf of science."

    These overinflated claims are often associated with scientific conclusions transmitted in a summary way, as if no uncertainty exists. Also, some propositions announced as "scientific conclusions" are sometimes not the fruit of observation and experimentation, but just the logical or mechanical outcome of some theoretical or computerized model, not yet corroborated empirically. In yet other instances, enormous potential implications are mentioned for a modest study of a few cases, probably not yet confirmed by replication.

  12. Voltaire's quote comes to mind: The perfect is the enemy of the good.

    While the news borne by Ioannidis, Paulos, Lehrer and Freedman is unsettling, the fact that the information is being sincerely engaged with and discussed testifies to the existence of intellectual honesty in the science community. When otherwise intelligent people avoid uncomfortable truths, fudge test results, exaggerate claims and cherry pick data, what separates scientists from pseudoscientists is this: scientists are prepared to admit of systemic flaws and subsequently correct them to the best of their abilities. Quacks and peddlers of woo simply cry persecution when they're called out.

    Bad Science author and doughty quack-buster Ben Goldacre argues that journalists should have a basic grasp of how science works if they are to avoid the sensationalism trap. Actually, when it comes to the often negative role the media has played in the public perception of science, Goldacre has alot to say about that. I address some of his arguments in this blog post.

  13. Science is not a fountain of 'Truth' with a capital 'T' science is instead the source of the best (and most recent) guess and the alternative is something less than the best guess. I don't really expect more of it than that. And when you make that 'best guess' into entertainment and the grease for a tenure track position then the best guess gets a little worse. No surprise there.

    As a frequent reader of science news I have pretty much learned to tune out the word 'breakthrough'. That wolf has been cried a little too much. I am still waiting (but not holding my breath) on the jetpacks, flying cars and electricity too cheap to meter.

  14. I wonder how much of the decline effect might be explained by the hypothesis that scientists in "hot" new fields are sloppier than the more stolid sorts who try to reproduce things.

    I have certainly been the victim of the decline effect in my own work. But on that occasion, we were never satisfied enough to publish, and the result we saw eventually went away before we did. I am not convinced that all of my colleagues would have done the same thing. I recently read a paper with a positive result for the exact same relationship as we (and others) had previously tested as an unpublishable negative some time ago. When I scratched their methods, there was insufficient detail described, but just enough to tell that they seemed to have controlled for no confounding factors, which was certainly a bad start.

  15. As a vegan commentator here, I have to draw attention to the millions of tiny lives and huge suffering involved in the"small studies that will ‘add another brick to the temple of science.’", of which "a very small percentage (of medical research) is ever likely to lead to major improvements in clinical outcomes and quality of life". I would maintain that the real issue is whether "doing evil that good my come", but this sort of thing does lend great force to those who protest against animal testing, vivisection and so on as actually not very useful - or at very least, not proportionately useful to the horrors involved.

  16. I agree with Massimo when he says that "the problem isn't (necessarily) with science, but with the overinflated claims that some scientists (and certainly a lot of science journalists) make on behalf of science. That isn't going to help public understanding of science, and eventually may dramatically undermine public trust". But whilst overinflated claims may not be inevitable in science practised in the abstract, I think they are pretty much unavoidable when competition for funding is structured into the practice of modern science. Scientists launching big projects seem to inevitably talk of “opening the book of life” or “delving into the secrets of the universe” - sweeping statements which seem bound to disappoint.

  17. "I agree with Massimo when he says that "the problem isn't (necessarily) with science,"

    heh, the problem isn't with socialism, it's with the subjects... :)

  18. I realize I'm very late to this discussion, but I've had a question since your podcast with Dr. Novella. I'd love it if you could discuss it in a future podcast or blog entry.

    You quote:

    “80 percent of non-randomized studies (by far the most common type) turn out to be wrong, as do 25 percent of supposedly gold-standard randomized trials, and as much as 10 percent of the platinum-standard large randomized trials.”

    This may be a naive question, but ... how do these studies "turn out to be wrong"?

    More specifically, what method is used to figure out what's right so that we can compare? Isn't the right conclusion just the result of these (and other) studies?

    Isn't that quite tautological?

    If science determines what we eventually decide is right, isn't this an argument in favor of science rather than against?

    I understand that, by definition, we never have an absolute "right," and we're always just approximating it as best we can. But that still doesn't make me any less uncomfortable with the idea of criticizing a method by comparing it to further results gleaned by that same method. This seems like a trick.

  19. dwayne, the studies are wrong in the simple statistical sense that they claim significant results when in fact those are not backed by the available data.

  20. Oh, so the studies are not later shown to be wrong; they can be shown as wrong -- or, at least, unsupported -- even at the time they're published?

    OK, that makes more sense. They're just overhyped.

    I see your reply above basically said this. Thanks.


Note: Only a member of this blog may post a comment.