Sunday, January 29, 2012

Good solution vs. "the" solution

Most interesting problems have more than one solution; it's also usually the case that you can do pairwise comparisons and determine if one solution is better (faster, or simpler, or more reliable, or what have you) than some other one. For some problems, there exist something that is the solution. It's a solution that does much more than solve the problem: it makes the problem go away. It dissolves the problem. When you find the solution to some problem, you'll wonder why anyone ever thought it was a problem in the first place. For example, the imaginary unit first appeared as part of a piecemeal solution of the problem of finding general roots of a cubic equation. But if you imagine someone who grew up in a reality where complex numbers were discovered and used before real numbers, you'll probably agree that that someone couldn't even understand why finding roots of cubic equations should be any more difficult than quadratic equations.

Now I'll do something all of us do but rarely admit to: I'll talk about something I really don't have the first clue about. Quantum mechanics! I think that the "many worlds" meta-theory of quantum mechanics is the solution to what's known in other interpretations as the "measurement problem" (i.e. the empirical fact that apparently identical quantum systems behave differently depending on whether they are observed or not). The many worlds interpretation dissolves the problem completely: It shows why, given that quantum systems don't really behave differently when they're measured than when they're not, it appears to us that they do. If by some historical accident the many worlds interpretation were the first meta-theory of quantum mechanics to be developed, no one would have even coined the phrase "measurement problem".

Tuesday, January 24, 2012

Majority, average, what's the difference

So, I was just listening to this here podcast. It started off with a discussion of James Surowiecki's The Wisdom of Crowds which talks about the fact that when you're trying to estimate some quantity, an average over a large number of individual guesses of people picked at random will be closer to the truth than an expert's opinion. The interviewed guest gives an example of Francis Galton's observation that the crowd at a county fair accurately guessed the weight of an ox when their individual guesses were averaged (the average was closer to the ox's true butchered weight than the separate estimates of any of the cattle experts). He then goes on to say that the reason behind this and similar, seemingly magical phenomena, is Condorcet's Jury Theorem: Take a group of people each of whom is more likely to get the right answer than the wrong answer and ask them the question. As size of group increases, the probability that the majority gets the right answer approaches 1 in the limit; the same holds for pluralities. It is also why surveys are accurate.

OK, something's not quite right here. Lots, actually. In the context of the ox example, what does it mean that each group member is "more likely to get the right answer than the wrong answer?" Weight is a continuous variable, so for each group member, the probability that they'll guess the right answer is precisely zero. Grad school was a long time ago, but I seem to remember something about Condorcet's Jury Theorem being applicable only to situations of binary choice (hence the word "Jury" in the name). The average of individual guesses of a continuous quantity, and a majority pick from two alternatives, are very different things. Also, the reason surveys work is Central Limit Theorem and not Condorcet's Jury Theorem. The only thing those have in common is the word "Theorem" in the name.

Saturday, January 21, 2012

Tell us what to think

I've stumbled onto the striking graph below via FlowingData (the original source is ProPublica):


It's a good thing that this shift happened, but the magnitude and timing of it doesn't speak very well of the members of Congress. It reveals that they either have no principles at all or can't even be bothered to read the legislation proposals that they endorse or criticize.

Saturday, January 7, 2012

Things I wish would die: The phrase "statistical dead heat"

You hear this phrase a lot any time there's an important election approaching. It refers to a situation when the difference between percentages of respondents declaring they'll vote for candidate A and candidate B is smaller than the margin of sampling error. For example, suppose there will soon be a Republican presidential primary in Florida where voters will be choosing between two candidates, Mitt Romney and Ann Coulter. A public opinion poll comes out, showing that 51% of respondents say they'll vote for Romney, and 49% of them say they'll vote for Coulter. The polling company says that the margin of sampling error in this poll is three percentage points. The media declare that Romney and Coulter are locked in a "statistical dead heat" or "statistical tie" because, given the margin of error, Romney's true vote share could be as low as 48%, and Coulter's could be as high as 52%.

But to represent this situation as a tie is highly misleading, for several reasons. I'll concentrate here on two of those reasons, to show just how misleading it can be. In what follows, I'm making a simplifying assumption that sampling error is the only source of uncertainty in my example poll. This is of course unrealistic, but completely justified, since sampling error is the only type of uncertainty that is reported by the media.

First, the size of the margin of error depends on the significance level chosen for the particular poll. Most polls choose to report a 95% confidence interval. Suppose that's the case with our fictional Romney v. Coulter situation. What this means is that, if this poll were to be redone a large number of times, with the same sample size, then 95% of the time Romney's vote share would fall somewhere between 48% and 54%. Put another way, it means that the difference between Romney's and Coulter's vote shares is not statistically significant at a 5% level (but if we chose, say, a 68% confidence interval, then the margin of error would be approximately 1.53--smaller than the spread). So the fact that the difference between Coulter and Romney is smaller than the margin of error doesn't mean Romney and Coulter are tied; it means that Romney is most  likely ahead but, if we hold ourselves to a 5% significance level, we can't say exactly by how much. Even when your point estimate is not significant at the level you chose, it is still the best guess you have.

Second, when we look at polls, we don't really care about spread; we care about who's more likely to win. The reason we pay attention to spread at all is because we treat it as a proxy for the probability of winning. So let's think about this in those terms (again, assuming sampling error is the only source of uncertainty). At 5% significance level, the margin of sampling error of a statistic is 1.96 times the standard error of that statistic (if you want to know how the standard error of a proportion is calculated, look below the fold). Thus, the sampling distribution of Romney's vote share is normal with mean 51 and standard deviation 1.53 (= 3/1.96). Below is a plot of that distribution. The ratio of the area shaded in red to total area underneath the curve is the probability that it's actually Coulter who's ahead (i.e. it's the probability that the true percentage of voters who intend to support Romney is less than 50). That probability is about 25%. The odds of Romney being ahead of Coulter are 3 to 1; doesn't sound like dead heat to me.



Wednesday, January 4, 2012

True beauty is attained indirectly

Since lately I'm all about short quotes, here's another one:
And if function is hard enough, form is forced to follow it, because there is no effort to spare for error. Wild animals are beautiful because they have hard lives.
This one is from a wonderful essay by Paul Graham. As a programmer, Graham uses Lisp. Go figure.

Tuesday, January 3, 2012

Try to beat that Hirsch Index

This is one of those things that, when you first read it, your immediate thought is "this can't be true." Here's John D. Cook quoting historian Clifford Truesdell:
… in a listing of all of the mathematics, physics, mechanics, astronomy, and navigation work produced in the 18th century, a full 25% would have been written by Leonard Euler.