Machinations


The Decline Effect
March 3, 2011, 11:15 pm
Filed under: Uncategorized | Tags: , ,

In the most recent issue of Nature there is a short piece on the decline effect, which was discussed in a longer article in the New Yorker a few months back [1].  Succinctly, the decline effect is the phenomena that the empirical support for certain scientific hypothesis declines over time.  The New Yorker article gives several examples of this effect.  One important example is decline in empirical support for the effectiveness of certain medical drugs.  For example, a study mentioned in the article shows that the demonstrated effectiveness of anti-depressants has decreased by as much as three-fold in the past decades.  A more frivolous example is the decline in empirical support for E.S.P. – this peaked in the 1930’s, with several research papers showing empirical support, but declined markedly in the next decade [2].

Several explanations have been proposed for the effect, and there are even scientists who are now doing research specifically on the decline effect!  To me, the most reasonable explanation proposed consists of two parts: 1) scientists like hypotheses that are surprising but true; and 2) the criteria for the “truth” of a hypothesis (in many publication venues) is empirical support with 95% confidence.  The second fact suggests that, when empirically testing a hypothesis that, a priori, has a 50% chance of being true, there is a 2.5% chance of getting a false positive.  The first fact suggests that many hypotheses that are tested have an a priori chance of being true that is not much more than 50%.  After all, if a hypothesis has, a priori, close to a 100% chance of being true, then it’s not very surprising, is it?  Thus, these two facts together suggest there will be a non-negligible fraction of hypothesis that scientists investigate, for which empirical testing will give false positives.

One solution for the decline effect, proposed in both articles, is the creation of a kind of “Journal of Negative Results”, which would publish empirical results that contradict certain hypotheses.  I heard this idea bantered around a lot during my grad student days, but I don’t think it will ever go very far.  Where’s the glory in finding facts that are boring but true?  Of course, there’s also the possibility of either 1) raising the 95% empirical support threshold necessary for publication; or 2) requiring empirical verification by many independent studies before allowing for publication.  Both of these ideas seem reasonable, but could slow down the scientific process.

Another idea (that was not proposed in either of the articles) is to raise the a priori support for a hypothesis.  In Computer Science, we have a mathematical methodology that frequently allows us to prove at least parts of a hypothesis that we want to show is true.  I think the decline effect is a nice justification for including at least some theoretical analysis in any paper in Computer Science.  Sometimes experiments can allow us to reach further than a purely theoretical study will allow.  However, it seems important, in almost all cases, to provide at least some theoretical support for a hypothesis.

[1] The New Yorker rarely publishes articles on science, but when they do, the articles generally seem to be among the very best science writing out there.

[2] Believe it or not, according to the New York Times, a paper in “one of psychology’s most respected journals” recently claimed support for the existence of a certain type of ESP in college students.  It remains to be seen whether or not there will be a decline effect in the support for this new hypothesis (hint: If I was a betting man, I wouldn’t bet on the hypothesis!)

Advertisements

6 Comments so far
Leave a comment

Part of the problem with the “95% confidence” is that it means P(data | null hypothesis) .95

Comment by Dave B

In many ways we are in a much worse situation in computer science regarding experiments. In other fields there are reasonably well-established notions of what constitutes a good sample for one’s statistics. Even though a large majority of health studies have very substantial selection biases and samples that are too small, everyone knows what the ideal is: something like the Framingham study on heart disease or the recent large-scale study that showed the risks of estrogen supplementation. In physics the statistics predicted by quantum theory are clear and well-defined and the goal of careful experimentation is to eliminate alternative explanations. In computer science, since many of the inputs are themselves based on human artifacts we do not seem to have a good notion of the ideal. Except for some artificial distributions that we do not have any particular reason to believe are relevant, we rely on suites of benchmarks over which there is no natural probability distribution. Without these we can’t associate confidence levels like 95% and we can’t put error bars on our experimental claims (unless they concern the runtime of our randomized algorithms on specific instances). Moreover, even asymptotically, the input spaces are so heterogeneous that we must compare algorithmic solutions that do not uniformly dominate each other and for which any optimality could only be in the Pareto sense.
It is no wonder that we are reduced to trying to provide proofs of claims.

Comment by Paul Beame

@Paul: This is a good point about the lack of good samples. Another unique issue with experiments in CS, I think, is that that there is often a greater possibility for error in running experiments, because of either 1) programming error, or 2) “algorithmic” error. One example of the latter is the use of trace routes to measure degree distribution on the Internet. Several early experiments using trace route sampling suggested that the Internet was power-law distributed. But a STOC paper a few years ago (“On the Bias of Traceroute Sampling”), by Achlipotas et al. shows that trace route sampling gives a power-law distribution even when the underlying degree distribution of the graph is Poisson! When doing the initial experiments, a mistake was made in devising the right algorithm to perform the measurements, which invalidated all the empirical results. In CS, I think it is often pretty challenging to code up a completely correct algorithm to run the experiment.

@Dave: Agreed. However, unfortunately it seems difficult to incentivize increasing the 95% confidence threshold, since it is pretty firmly entrenched in many disciplines.

Comment by Jared

It seems that most of my previous somehow didn’t make it. What I meant to say is that 95% confidence is found by showing that the probability of the data is low given the null hypothesis. Then this is presented as if it means that the probability of the hypothesis being tested is high given the data. This isn’t a problem that can be fixed just by tightening the threshold.

Comment by Dave B

Dave,

Now I think I understand your comment. In the ESP experiments, for example, a good null-hypothesis would be “humans can’t guess better than chance”. A bad null-hypothesis is “ESP doesn’t exist”. The negation of the first null-hypothesis allows for many possibilities: people pick up clues from the experimenters, people learn patterns in the computer’s random number generator, etc. Picking the wrong null-hypothesis is clearly a problem that can occur even with a high confidence threshold.

Comment by Jared

Jared,

I think the problem is worse than that. From what I’ve read of the ESP studies, the strongest result was a ~52% chance of guessing right what should be a 50/50 guess. Even assuming that they got the null hypothesis right, this data seems unlikely given ESP as a real phenomena. Why would people who are able to see into or change the future be so remarkably poor at predicting it?

If you hypothesize that ESP gives a 60% chance of guessing right, then even taking the results at face value, the null hypothesis is better supported. The results only support ESP under a model in which ESP would give something close to a 52% success rate.

The problem with this work is that they were essentially testing was basically “on some of these tests, the data will differ from chance in some way.” At best, this was exploratory work to generate hypotheses for future testing.

Of course, all this still leaves us with problems of experimental error and poor design as you rightly pointed out.

Comment by Dave B




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



%d bloggers like this: