A few people have expressed an interest in understanding some of the math behind the analysis in the Presidential Election Predictor. I'm already regretting the title, but I'll give a reasonable shot at explaining the underlying math.

Various polls of voter preference are taken throughout the campaign. These are not direct predictors of the
outcome of the election, but they are snapshots in time. If a poll samples 1000 voters, and they've done
a good job of eliminating systematic errors, and the voters are both insightful and honest, then *on
average* they will give a good prediction of how the election would turn out if it were held when the
poll was conducted.

Nonetheless, the actual voters polled are only a sample of the total electorate. Just by chance, they may select more voters who prefer one candidate over the other. In a close race, this sampling error is very nearly one over the square root of the number of voters sampled. In this case, about 3%. That means that if the polling company conducted the same poll over and over again, their answers would vary by about three percentage points.

Now, predicting the outcome of the election is not the same thing as determining the actual results if the election were held today. That could be done by averaging enough polls, thereby increasing the number of voters sampled and thereby decreasing the uncertainty from sampling error. There are two other categories of effects to consider. First, the election is not held today. (Well, if you're reading this on election day, you've caught me on a technicality.) Voter sentiment can change. At the very least, many undecided voters will decide, but some people will also change their minds. Second, the polling process will not be free of systematic errors. That is, the sample of voters who participate in the polls may not be representative of the voters who actually cast votes on election day.

Polling companies will try to identify and compensate for systematic errors. This is generally summed up in the catch-all phrase "likely voters." That is, they try to restrict their sample to voters who they think will actually cast ballots, and they may also try to identify and compensate for systematic biases in their sample. For example, they may ask about party preference, hoping to compensate if they have a larger fraction of Republicans than they should based on lists of registered voters and/or past voter turnout.

Systematic errors creep into polling in many insidious ways. Sometimes pollsters will try to sway the results by deliberately introducing systematic error. They may phrase questions is a way that seem to be asking how the sample will vote on a candidate or issue, but nudging the polling participant toward a particular answer. ("If the election were held today, would you vote for the senator who tried to keep us out of war or the senator who got us in deeper?") More subtle techniques can be used to influence the results. The order of the questions or the time of day when calls are made, for example.

Understanding and characterizing the random and systematic errors in the polling process, it is possible to "predict" the outcome of the election. By this, I mean to make a statement of probability: "The probability that John McCain will win the presidential election is xx%." There is lots of analysis and lots of guesswork that goes into this type of statement. The prediction will depend on the skill and assumptions of the analyst. A firm grasp of probability and a keen eye toward systematic errors is crucial. A keen eye toward the ways that the analyst can introduce systematic errors is part of this. There are guidelines for statistical analysis that can and should be used to prevent the analyst from biasing the prediction.

The whole concept of futures markets is based on the idea that a group of analysts with access to a wide variety of information can pool that information to make a better probabilistic assesment than any single analyst. This often seems to be the case, and it seems to work significantly better when the analysts put their money behind their results.

The way it works is this. If I feel that Barack Obama's chances of winning the election are about 65%, and I have the opportunity to purchase a future guarantee of one dollar if he actually wins, then how much would I be willing to pay for that future guarantee? If the payback is soon enough that I don't have to consider the time value of money, then I'd be willing to pay up to $0.65 for that future guarantee. The idea of futures trading is to give people the chance to make those personal assessments based on the information that they have available. The market should settle on a value where half the participants (weighted by their willingness to put money behind their convictions) believe that the outcome is more likely than the market value of the futures contract, and half of the particpants believe that the outcome is less likely.

There are a variety of futures markets available for the outcome of the U.S. Presidential election. There are contracts for the overall popular vote, for the overall outcome of the electoral college, and also for the allocation of electoral college votes from each state individually. These futures markets are intrinsically predictors of the outcome of the election. They are subject to errors, biases, misinformation, etc., but they also have some ability to correct for these biases in ways that polls of voter preference do not.

I've written a Presidential Election Predictor that takes the values of futures on the state-by-state outcomes and combines these into a prediction of the overall national outcome. The prediction from the individual state outcomes closely matches the value of a future on the overall outcome of the Presidential election. The rest of this web page describes the techniques that I used to perform the meta-analysis of the individual state-by-state future values, in order to arrive at an overall prediction of the outcome.

The participants in the futures market do not know the outcome of the election in advance. Their assessment can be viewed as an analysis of the electorate, resulting in a probabilistic assessment of the outcome. There are a range of outcomes that correspond to the margin of victory (in any units -- votes, percentage of votes, etc.) We can reasonably consider these outcomes to be normally distributed, and the probability that one candidate wins is the fraction of outcomes where they get more votes than their opponent. This probability corresponds to a z-value if the outcomes (as viewed by the participants in the futures market) are distributed with a normal probability function.

Let's think of the probability distribution as a kind of super-poll. It's not a very accurate way of looking at it, but it helps introduce and clarify the idea that the outcomes have a mean and standard deviation. Suppose for the moment that all voters have decided, none of them will change their minds, and they all know whether or not they will vote. Suppose also that the polling companies have no systematic biases in the way that they've conducted their polling. In this case, the polls are an accurate predictor of the electoral outcome. Here the term "accurate" means that the mean value of the poll results will be the same as the actual election returns. But, since each poll is only a sample of the voting population, there is some random variation in the poll results.

So, suppose that there are three polls, and all of the participants know the results of these polls and no one has any better information. Every investor would be able to analyze the results of those polls and determine the probability that each candidate will actually win the election. In this case, there is only one source for the variance in the distribution of possible outcomes; that is, the sample size of the polls.

Now, suppose that the same is true, but that some voters will change their minds based on last minute campaigning and get-out-the-vote activities. This spreads out the range of possbile outcomes. It increases the variance of the distribution. There are now two types of uncertainty in the election results. How well the polls represent the actual opinions of the voters, and how many voters will change their minds between now and the election.

The futures market (at least allegedly) takes all of this into account when making a probabilistic assessment of the outcome. If we make an assumption that the possible outcomes are normally distributed, then the future value tells us where the 50/50 threshold between victory and defeat is on that bell curve. It doesn't actually tell us what the mean value is or what the standard deviation is, but it definitely implies a particular ratio between the mean and the standard deviation. This ratio is called the z-value.

We don't actually care about how the variance is divided between random uncertainty (such as the sample size of the polls) and the systematic uncertainty (systematic polling error, shifts in voter preference, voter turnout, etc.). Within each state, those effects have all been considered by the participants in the futures market. What we want to separate is the uncertainty in the outcome that varies on a state-by-state basis, compared to the uncertainty in the outcome that varies nationally.

Let's consider only two effects. Let's suppose that there are random polling errors due to sample size, and let's also suppose that one party does a better job of getting their voters to the polls. The sampling errors will be completely independent from state to state. If the poll averages for John McCain were a little low in Florida, there is no reason to think that they would also be low in Ohio. On the other hand, if the Obama campaign does a better job of getting their voters to the poll in Wisconsin, then they probably did a better job in Iowa, too. Since we don't really know how these effects play out until the election is actually held, these both contribute to the uncertainty in the outcome for each state. If there is no national trend in the uncertainty, then the probabilities for all of the states are independent. On the other hand, if there is a national trend (such as a late swing in voter preference), then it pushes all of the states in the same direction.

These two types of effects combine in the same way when considering the outcome in a single state:

σ^{2} = σ_{s}^{2} + σ_{n}^{2}

where σ^{2} is the variance in the overall outcome at the state level, σ_{s}^{2}
is the variance that applies only to the individual state, and σ_{n}^{2} is the variance
that is common to all states throughout the nation. It should be obvious that this is a great simplification.
There will be effects that vary throughout groups of states, but not throughout the nation as a whole. Two examples:
There could be some late-breaking news that is of regional interest—affecting voters in coastal states with
potential offshore drilling, for example. Or, there could be a last-minute ad campaign that is only aired in
battleground states. Still, this is a useful simplification and captures the main characteristics in the way that
state-by-state results combine to give a national result through the electoral college.

Now consider the way that these two types of errors affect the national outcome. The state-by-state variance describes how the results from each state varies independently from the expectation. In some states, Barack Obama will do better than expected and in some states, John McCain will do better than expected. This variation gives "equally" to each candidate. I quoted that word, because this state-specific variation could hand Colorado or Ohio to one of the candidates, and tip the entire election. On the other hand, a national effect is applied in the same direction to every state at the same time. This tips the entire scale, giving one candidate or the other better-than-expected results in Virginia, Ohio, Colorado, New Hampshire, etc.

We can look at the probabilistic assessments of futures markets for each state outcome. It is clear that there is some uncertaintly in the national trends. This is indicated by the relatively large probability of an underdog win in states that are not considered "battleground states". Consider Oregon, for example. Based on current poll data, John McCain's chances of winning Oregon should no more than 5%, based on sample size or even considering systematic errors in the polling methods. Still McCain might just win Oregon if there were a shift of a few percent in the overall voter sentiment. This is why participants in the futures market are willing to consider the possiblity that McCain might pull off a win in Oregon.

I've evaluated the probabilities of Obama and McCain victories in various states based solely on polling data, then compared those probabilites with the probabilities of Obama and McCain wins in those same states based on futures values. I assume that a constant percentage of the variance in each state is due to national trends, and attempt to estimate that percentage. If we look at Oregon as a case study, the 17% chance of a McCain victory corresponds to a z-value of 0.95, whereas the z-value from polling data is approximately 1.56. This implies that the variance from polling is approximately 37% of the overall variance. I've been using values between 25% and 40% for the state-to-state portion of the variance in my meta-analysis. I chose it because it puts the right number of states into the in-play zone as suggested by polling data. There's some state-to-state variation, but the overall fit seems reasonable. Playing with this parameter indicates that the results of he meta-analysis are not that sensitive to the actual value.

Once this single free parameter has been chosen, the stated assumptions give a single value for the probability of each electoral outcome (Obama or McCain). The math goes something like this:

First, calculate new z-values that use the state-by-state uncertainty instead of the combined uncertainty. If 40% of the variance is state-by-state, then the state-by-state standard deviation is 63.2% of the combined standard deviation. So, if we start with a future value of 65% for an Obama win, that corresponds to a z-value of 0.385 on the normal distribution with the combined variance. If we reduce the variance to 40% of the combined variance, then the z-value is 0.609 in the state-specific distribution.

It's relatively straightforward to calculate new state-by-state probabilities for each candidate. These represent the electoral outcomes in the event that the net national effect turns out to be zero. These probabilities can be used in a permutation analysis to sum up the probabilities for each candidate of getting 270 electoral votes (or only 269 for Obama, since the Democrats will control the House of Representatives).

Now I consider the effects of shifting all of the states together by an
amount implied by the remaining 60–75% of the variance. This is equivalent to adding the same random variable x to every
state-specific z-value. The standard deviation for the distribution of x is σ_{n}/σ_{s},
or 1.225 for the case where 60% of the variance is from national effects. Now I can calculate the state-by-state
probabilities for each value of the nationwide variation. This is a bias analysis.

The points in the bias analysis are very close to a cumulative normal distribution. Fitting these points to such a distribution gives a mean and variance. The variance from the fitted cumulative normal distribution is added to the variance from the bias curve. The combined variance and the mean are used to determine the overall probability of victory for each candidate, and the lead expressed in standard deviations.

Steve SchaeferLast modified: Date: 2008/09/16