The ‘birthday paradox’ and why polls can’t tell us everything

The Statistics Canada offices in Ottawa are seen on Tuesday, May 1, 2013.Sean Kilpatrick/The Canadian Press

The pervasive tone of dissatisfaction among many of Canada's economists to last week's National Household Survey rubbed some readers the wrong way.

They wondered why we need to bother with such incredibly expensive probing of our national populace anyway? Hasn't the science of polling proven that we can take a relatively small sample of Canadians at a relatively small cost, and get very close to the same answers? If polls are good enough for gauging the likely outcomes of something as important as an election, shouldn't they be good enough for, say, telling us how much an average Canadian earns in a year?

These are good questions after the government has just spent a reported $650-million on a massive poll that covered nearly one-third of all Canadian households, and produced a massive amount of data that Statistics Canada considers incredibly valuable – but that many economists still don't think is good enough. Maybe, just maybe, economists are a bunch of ivory-tower grumblers who won't accept a good thing when they see it and aren't spending enough time in the real world.

Or maybe they are well aware that polls, as much as we have come to rely on them, are very fallible. (Yes, even really big ones.)

Relatively small-sample polling works pretty well on simple questions, but when you seek a lot of increasingly complex and nuanced information (like, say, those contained in the old long-form census, or the NHS), its reliability starts to break down. There are also well-understood biases surrounding who is willing to respond versus who isn't; for example, the top and bottom of the country's socioeconomic spectrum are routinely under-represented in voluntary surveys.

And that value in predicting an election? Lately, not so much. In the Alberta and British Columbia provincial elections of the past two years, the polls famously diverged from the actual results by massive, head-scratching, what-the-heck-were-the-pollsters-drinking margins.

And those election polls – which ask a pretty straightforward question – have errors typically of plus-or-minus 2 to 5 per cent (meaning a range of 4 to 10 per cent), usually 19 times out of 20. Which means that once out of every 20 polls, the results are somewhere outside that range. Where? We have no idea. Statistically speaking, one of of every 20 political polls is just wrong.

Now consider the margin of error in one of Statistics Canada's best-known and closely followed "polls" – its monthly Labour Force Survey, where the country's headlines on job creation and unemployment come from.

Statscan reported last week that Canada had added 59,200 jobs in August – but it also identified its "standard" margin of error as 28,900. What that means is that the job-gain figure is plus-or-minus 28,900, TWO TIMES OUT OF THREE. Which means one in three times, the survey is outside this sizable range. You want to ramp that up to 19 times out of 20, like in an election poll? Then you DOUBLE the standard error. Which means that when you see "Canadian jobs worse 59,200 in August," what the polling error is actually telling us is that there's a 95-per-cent likelihood that Canada gained somewhere between 1,400 and 117,000 jobs last month. And there's a 5-per-cent chance that it was some other completely different number outside that range – perhaps even a big job loss.

This is why we bother with a national census, or the National Household Survey. Every few years, it's important that policy makers get as full and complete a set of data on the country as they possibly can; they don't want to making long-term, multi-billion-dollar government program decisions based on potentially flawed, skewed, inaccurate or flat-out wrong data.

Of course, Statscan also does many smaller samplings to gauge the state of many elements of Canadian society and the economy as we go along, and without doubt, they are useful and relatively cost-effective. But small samples are imperfect.

As an illustration, I present a statistical argument that one reader raised as evidence that we don't need big polls: It's known as "The Birthday Paradox."

You hire a pollster to start calling people in search of two people with the same birthday. You might think it would take, on average, 366 calls to hit on a match (that's the total number of possible birthdays, and we figure on a pretty even distribution). However, we discover that statistically, there's a 50-per-cent chance that the pollster will find two people with the same birthday after just 23 calls; once he has talked with 57 people, the probability of a match goes up to 99 per cent. So, it turns out, it's a complete waste of time and resources to talk with 366 people. (There are some quirky statistical reasons why this finding is misleading – after all, it's not called a "paradox" for nothing – but I'm assured that it does work.)

So my pollster stops making calls after 57 people (or fewer), since he almost certainly found a match, and saved me a tonne of money. But what do my statistics suggest? That 3.5 per cent (or more) of Canadians have the same birthday. Intuitively, I know that's wrong; simple logic suggests that it's more like 0.3 per cent (i.e one out of 365). Yet the data from my small-sample poll tells me something wildly different.

That's not so useful, is it?

Follow related authors and topics

Interact with The Globe

Latest in

Follow related authors and topics

Interact with The Globe