How to Read a Poll: Why "Margin of Error" Probably Doesn't Mean What You Think
Tue Jul 01, 2008 at 05:19:01 PM PDT
I'm a social scientist. I study all kinds of things and use lots of different methods in my research, but mostly I make my living analyzing large amounts of data collected through surveys. By necessity, that means I'm pretty good at using advanced statistical techniques to figure out what large populations think and how they behave.
A political poll is essentially nothing more than a public opinion survey. Lots of people think conducting a survey is easy. Lots of people are mistaken.
Below the jump, I'll explain what I mean, and I'll also explain (without resorting to the technical jargon characteristic of my field) why margin of error is important, how it works, and why so many people misinterpret it.
So you want to conduct a survey.
There are those who would tell you that all you have to do is walk up to any ten, one hundred, or one thousand people, ask them your questions, record their answers, and report the results. When those people are my students, I call them "people who receive a failing grade on the assignment." When they're professional pollsters, I call them "incompetent" or "frauds." You see, the whole point of conducting a survey is to find out what your target population thinks about some issue, or how they behave. By extension, the whole point of conducting a political poll is to find out who would win an election if it was held today (or some variation on that question).
But suppose I ask ten people who they're going to vote for in this year's presidential election. Suppose seven tell me they're going to vote for Obama and three tell me they're going to vote for McCain. Does that mean Obama has a 70% to 30% lead?
All together now: NO!
Suppose I ask 1,000 people, and 700 say they'll vote for Obama, while 300 say they'll vote for McCain. Does that mean that Obama leads 70% to 30%?
Once again, the answer is NO! Why?
The biggest reason is that we haven't said anything about who constitutes the sample. If you want to know how the general population is going to vote, you need to survey the general population. If the 1,000 people you just surveyed are people you met at a fundraiser for Obama's campaign, Obama may actually be in a lot of trouble -- if he can't win more than 70% of the vote from people who actually contribute to his campaign, how is going to win 50% of the vote from the nation as a whole?
So your first issue is you need a representative sample. You need your sample to have the same proportions of women, evangelicals, Jews, Midwesterners, Southerners, Republicans, Democrats, Independents, rich, poor, Hispanics, blacks, whites, purples-with-green-polka-dots, and any other demographic you can think of as you would expect to find in the general electorate. Now, you're never going to get it exactly right -- in part because you're not going to know the composition of the actual electorate until after the election, and even then it's going to be a best guess based on statistical models -- but you need to try to get as close as you can. If you really know what you're doing, you'll use a technique called "weighting" to fix your sample as close as possible to the composition of the actual population, but that's a topic for another time.
In any case, let's say you now have a representative sample, one that matches the overall electorate as closely as you can manage. Let's say that your sample consists of 1,000 people. Suppose of those 1,000 people, 500 say they're going to vote for Barack Obama and 450 say they're going to vote for the guy who wants the third term of the worst president in US history.
500/1000 = 50%
450/1000 = 45%
So Obama must lead McCain 50% to 45%, right?
Not exactly.
"But wiscmass," you say, "50% of the sample say they're voting for Obama and 45% say they're voting for McCain. Surely it must be 50% to 45%!"
Ah, but you see, this is the difference between inferential statistics and counting every vote. Yes, if your 1,000-person sample actually was the entire voting population, it would mean that Obama leads 50% to 45%. But -- and this is the key -- your sample, representative though it may be, isn't actually the entire population.
This is where "margin of error" comes into play. The margin of error is defined as the statistic measuring the random sampling error in a survey. You see, no matter how perfect your technique in composing your sample, you will never create a sample that exactly represents the population as a whole. There is always some error. The greater the error -- the greater the margin of error -- the less confident you should be in the results of the survey, because the greater the error, the less likely it is that your results will reflect the population parameter, the true figure for the population as a whole.
In any survey, the margin of error is typically expressed in terms of what we call a "confidence interval." This essentially means we can be certain to a specific degree (typically 95%) that the true figure for the population as a whole is within a certain specified amount of the statistic that emerges from the survey. Let's use a recent example to illustrate.
A recent poll you may have read about here indicates that, as in our example above, Obama leads McCain 50% to 45%. The poll in question cites a margin of error of 3.5%.
"So wiscmass," you might ask, "if Obama leads by 5% and the margin of error is 3.5%, doesn't that mean that Obama's lead is outside the margin of error?"
That's a reasonable assumption, but it's incorrect. The margin of error reported with the poll does not refer to the magnitude of the lead, but rather the magnitude of each candidate's support. It means that based on the survey, Obama has the support of 50% of respondents and McCain has the support of 45% of respondents, but if you want to extrapolate from the results of the survey to the voting population as a whole, the true figures for each candidate's support are anywhere from 3.5% below the cited figure to 3.5% above the cited figure. In other words:
- Obama's support may be as low as 50%-3.5% = 46.5%.
- Obama's support may be as high as 50%+3.5% = 53.5%.
- McCain's support may be as low as 45%-3.5% = 41.5%.
- McCain's support may be as high as 45%+3.5% = 48.5%.
So to a 95% certainty, Obama's true level of support in the electorate as a whole is between 46.5% and 53.5%, while McCain's true level of support is between 41.5% and 48.5%.
Therein lies the problem with saying that Obama has a five-point lead -- statistically speaking, this is a dead heat! If the true figure for Obama's support is at the low end of our confidence interval and the true figure for McCain's is at the high end, McCain could actually be in the lead.
This is the most important thing to understand if you want to know what a poll really means, largely because it's the one thing far too many people don't understand. If Obama's lead isn't more than double the margin of error, his lead in the poll is not statistically significant. If his lead is more than double the margin of error -- if, for instance, his lead in the poll cited above was at least 7.1% -- then his lead is statistically significant and we can be 95% certain he is actually winning. Note, however, that while we can be 95% certain he is winning under such circumstances, there would still be a 5% possibility of disaster. And that possibility will always remain when we rely on inferential statistics to determine who is leading a race.
Update [2008-7-1 20:52:12 by wiscmass]:: At Jyrinx's suggestion (and a good one at that), I'm adding a link to FiveThirtyEight.com, where poblano (and others?) explain how all this works (and lots of other things people ought to know about polls) in very easy-to-understand, not at all jargoned language.