Saturday, January 7, 2012

Things I wish would die: The phrase "statistical dead heat"

You hear this phrase a lot any time there's an important election approaching. It refers to a situation when the difference between percentages of respondents declaring they'll vote for candidate A and candidate B is smaller than the margin of sampling error. For example, suppose there will soon be a Republican presidential primary in Florida where voters will be choosing between two candidates, Mitt Romney and Ann Coulter. A public opinion poll comes out, showing that 51% of respondents say they'll vote for Romney, and 49% of them say they'll vote for Coulter. The polling company says that the margin of sampling error in this poll is three percentage points. The media declare that Romney and Coulter are locked in a "statistical dead heat" or "statistical tie" because, given the margin of error, Romney's true vote share could be as low as 48%, and Coulter's could be as high as 52%.

But to represent this situation as a tie is highly misleading, for several reasons. I'll concentrate here on two of those reasons, to show just how misleading it can be. In what follows, I'm making a simplifying assumption that sampling error is the only source of uncertainty in my example poll. This is of course unrealistic, but completely justified, since sampling error is the only type of uncertainty that is reported by the media.

First, the size of the margin of error depends on the significance level chosen for the particular poll. Most polls choose to report a 95% confidence interval. Suppose that's the case with our fictional Romney v. Coulter situation. What this means is that, if this poll were to be redone a large number of times, with the same sample size, then 95% of the time Romney's vote share would fall somewhere between 48% and 54%. Put another way, it means that the difference between Romney's and Coulter's vote shares is not statistically significant at a 5% level (but if we chose, say, a 68% confidence interval, then the margin of error would be approximately 1.53--smaller than the spread). So the fact that the difference between Coulter and Romney is smaller than the margin of error doesn't mean Romney and Coulter are tied; it means that Romney is most  likely ahead but, if we hold ourselves to a 5% significance level, we can't say exactly by how much. Even when your point estimate is not significant at the level you chose, it is still the best guess you have.

Second, when we look at polls, we don't really care about spread; we care about who's more likely to win. The reason we pay attention to spread at all is because we treat it as a proxy for the probability of winning. So let's think about this in those terms (again, assuming sampling error is the only source of uncertainty). At 5% significance level, the margin of sampling error of a statistic is 1.96 times the standard error of that statistic (if you want to know how the standard error of a proportion is calculated, look below the fold). Thus, the sampling distribution of Romney's vote share is normal with mean 51 and standard deviation 1.53 (= 3/1.96). Below is a plot of that distribution. The ratio of the area shaded in red to total area underneath the curve is the probability that it's actually Coulter who's ahead (i.e. it's the probability that the true percentage of voters who intend to support Romney is less than 50). That probability is about 25%. The odds of Romney being ahead of Coulter are 3 to 1; doesn't sound like dead heat to me.



The standard error of a proportion p is given by


where n is sample size.

No comments:

Post a Comment