Statistics and coincidence: When good science loses its way

Odd data pairings - such as eating cheese before bed can cause nightmares - compiled by Tyler Vigen, show correlations are a good starting point for more sophisticated analysis.Thinkstock

You've heard the advice before: Go easy on the cheese before bedtime to avoid bad dreams.

What you may not know is that if you compare U.S. Department of Agriculture data on per capita cheese consumption since 2000 with the number of people who die each year from getting tangled in their bedsheets (more than 800 in 2008, according to the Centers for Disease Control and Prevention), you get an almost perfect match.

Apparently the more mozzarella we scarf, the more people meet this ignoble end. The correlation between the two data sets is 95 per cent, which indicates that they rise and fall in near-perfect sync.

The cheese and bedsheet-death link is one of tens of thousands of such pairings churned out by an algorithm programmed earlier this month by Tyler Vigen, a Harvard law student – and the point is that correlation, a statistical measure that assesses how closely two data sets match, isn't always what it seems.

Vigen says his idea was sparked by an image showing the surprisingly good match between the crime rate in New York and a photo of a mountain range.

He decided to look for other random correlations by uploading vast swaths of freely available data from places such as U.S. government websites and searching for matching sets.

These deliberately odd results, displayed on his website, are mostly ludicrous enough that no one would be tempted to believe that they're anything but coincidental.

As egg consumption rises and falls, for example, so too does the number of non-collision road deaths. And the number of non-commercial space launches around the world seems to depend on the number of sociology doctorates awarded in the United States.

In the real world, misleading correlations are much harder to spot, because they show up in situations that actually make sense.

For example, a 2009 Archives of Internal Medicine study of 500,000 people found a correlation between meat consumption and death during the 10-year study period, even when possible confounding factors like age, education, and exercise habits were taken into account.

This is a plausible finding – but if you look more closely at the study results, you find that red meat consumption also seemingly raises your risk of sudden accidental death from causes like car crashes and gun shots. This is clearly ridiculous, and indicates that there are other underlying lifestyle factors that affect both meat consumption and mortality.

Another place where suggestive correlations often show up is in attempts to explain the rise in obesity rates over the past half-century.

Fat and carbohydrate consumption have both risen in lockstep with obesity rates – but then again, so have protein and total calorie consumption, along with countless other factors like the processing speed of computers. The mere fact that two variables have both increased over time isn't enough to draw any conclusions.

So is looking for correlations a blatant misuse of statistics that should be disregarded entirely? Not so fast, Vigen says: "Correlations are an important starting place because they can influence they way we research."

In other words, they offer a useful starting point to generate hypotheses or test ideas. Vigen cites Randall Munroe, the author of the popular science Web-comic xkcd, who says: "Correlation does not imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there.'"

Once you know where to look, there are several ways to bolster a case built initially on correlation.

Statisticians have more sophisticated techniques for assessing the likelihood of coincidence when two variables show an apparent match, for example, by looking at how often the two lines cross and recross each other on a graph. Coming up with a reasonable explanation for why and how the variables influence each other is also important.

Still, the best to way to confirm a causal relationship between two variables is to change one of them and see how the other responds.

Knowing that the link between cheese and bedsheet deaths was semi-randomly generated by a computer should make you much less likely to believe it. But if anyone wants to run a study on dream patterns following prebedtime cheese consumption, I volunteer to be in the lots-of-cheese group.

Alex Hutchinson blogs about exercise research at sweatscience.runnersworld.com.