Tuesday 23 October 2012

Bayes for President!

I couldn't resist getting sucked into the hype associated with the US election and debates, and so I thought I had a little fun of my own and played around a bit with the numbers. [OK: you may disagree with the definition of "fun" $-$ but then again, if you're reading this you probably don't...]

So, I looked on the internet to find reasonable data on the polls. Of course there are a lot of limitations to this strategy. First, I've not bothered doing some sort of proper evidence synthesis, taking into account the different polls and pooling them in a suitable way. There are two reasons why I didn't: the first one is that not all the data are publicly available (as far as I can tell), so you have to make do with what you can find; second, I did find some info here, which seems to have accounted for this issue anyway. In particular, this website contains some sort of pooled estimates for the proportion of people who are likely to vote for either candidate, by state, together with a "confidence" measure (more on this later). Because not all the states have data, I have also looked here and found some additional info. 

Leaving aside representativeness issues (which I'm assuming are not a problem, but may well be, if this were a real analysis!), the second limitation is of course that voting intentions may not directly translate into actual votes. I suppose there are some studies out there to quantify this, but again, I'm making life (too) easy and discount this effect.

The data on the polls that I have collected in a single spreadsheet look like this
ID State Dem Rep   State_Name Voters Confidence
1    AK  38  62       Alaska      3    99.9999
2    AL  36  54      Alabama      9    99.9999
3    AR  35  56     Arkansas      6    99.9999
4    AZ  44  52      Arizona     11    99.9999
5    CA  53  38   California     55    99.9999
...      ...       ...      ...       ...

The columns Dem and Rep represent the pooled estimation of the proportion of voters for the two main parties (of course, they may not sum to 100%, due to other possible minor candidates or undecided voters). The column labelled Voters gives the number of Electoral Votes (EVs) in each state (eg if you win at least 50% of the votes in Alaska, this is associated with 3 votes overall, etc). Finally, the column Confidence indicates the level of (lack of) uncertainty associated with the estimation. States with high confidence are "nearly certain" $-$ for example, the 62% estimated for Republicans in Alaska is associated with a very, very low uncertainty (according to the polls and expert opinions). In most states, the polls are (assumed to be) quite informative, but there are some where the situation is not so clear cut.

I've saved the data in a file, which can be imported in R using the command
polls <- read.csv("http://www.statistica.it/gianluca/Polls.csv")

At this point, I need to compute the 2-party share for each state (which I'm calling $m$) and fix the number of states at 51
m <- Dem/(Dem+Rep)
Nstates <- 51

Now, in truth this is not a "proper" Bayesian model, since I'm only assuming informative priors (which are supposed to reflect all the available knowledge on the proportion of voters, without any additional observed data). Thus, all I'm doing is a relatively easy analysis. The idea is to first define a suitable informative prior distribution based on the point estimation of the democratic share and with uncertainty defined in terms of the confidence level. Then I can use Monte Carlo simulations to produce a large number of "possible futures"; in each future and for each state, the Democrats will have an estimated share of the popular vote. If that is greater than 50%, Obama will have won that state and the associated EVs. I can then use the induced predictive distribution on the number of EVs to assess the uncertainty underlying an Obama win (given that at least 272 votes are necessary to become president).

In their book, Christensen et al show a simple way of deriving a Beta distribution based on an estimation of the mode, an upper limit and a confidence level that the variable is below that upper threshold. I've coded this in a function betaPar2, which I've made available from here (so you need to download it, if you want to replicate this exercise).

Using this bit of code, one can estimate the parameters of a Beta distribution centered on the point estimate and for which the probability of exceeding the threshold 0.5 is given by the level of confidence.
a <- b <- numeric()
for (s in 1:Nstates) {
if (m[s] < .5) {
bp <- betaPar2(m[s],.499999,Confidence[s]/100)
a[s] <- bp$res1
b[i] <- bp$res2
if (m[s] >=.5) {
bp <- betaPar2(1-m[s],.499999,Confidence[s]/100)
a[s] <- bp$res2
b[s] <- bp$res1

The function
betaPar2 has several outputs, but the main ones are res1 and res2, which store the values of the parameters $\alpha$ and $\beta$, which define the suitable Beta distribution. In fact, the way I'm modelling is to say that if the point estimate is below 0.5 (a state $s$ where Romney is more likely to win), then I want to derive a suitable pair $(\alpha_s,\beta_s)$ so that the resulting Beta distribution is centered on $m_s$ and for which the probability of not exceeding 0.5 is given by $c_s$ (which is defined as the level of confidence for state $s$, reproportioned in [0;1]). However, for states in which Obama is more likely to win ($m_s\geq 0.5$), I basically do it the other way around (ie working with 1$-m_s$). In these cases, the correct Beta distribution has the two parameters swapped (notice that I assign the element res2 to $\alpha_s$ and the element res1 to $\beta_s$).

For example, for Alaska (the first state), the result is an informative prior like this.

In line with the information from the polls, the estimated average proportion of Democratic votes is around 38% and effectively there's no chance of Obama getting a share that is greater than 50%.

Now, I can simulate the $n_{\rm{sims}}=10000$ "futures", based on the uncertainty underlying the estimations using the code
nsims <- 10000
prop.dem <- matrix(NA,nsims,Nstates)
for (i in 1:Nstates) {
prop.dem[,i] <- rbeta(nsims,a[i],b[i])
The matrix prop.dem has 10000 rows (possible futures) and 51 columns (one for each state).

I can use the package coefplot2 and produce a nice summary graph
means <- apply(prop.dem,2,mean)
sds <- apply(prop.dem,2,sd)
low <- means-2*sds
upp <- means+2*sds
reps <- which(upp<.5)         # definitely republican states
dems <- which(low>.5)         # definitely democratic states
m.reps <- which(means<.5 & upp>.5) # most likely republican states
m.dems <- which(means>.5 & low<.5) # most likely democratic states
cols <- character()
cols[reps] <- "red"
cols[dems] <- "blue"
cols[m.reps] <- "lightcoral"
cols[m.dems] <- "deepskyblue"
vn <- paste(as.character(State)," (",Voters,")",sep="")
coefplot2(means,sds,varnames=vn,col.pts=cols,main="Predicted probability of democratic votes")

This gives me the following graph showing the point estimate (the dots), 95% and 50% credible intervals (the light and dark lines, respectively). Those in dark blue and bright red are the "definites" (ie those that are estimated to be definitely Obama or Romney states, respectively). Light blues and reds are those undecided (ie for which the credible intervals cross 0.5).

Finally, for each simulation, I can check that the estimated proportion of votes for the Democrats exceeds 0.5 and if so allocate the EVs to Obama, to produce a distribution of possible futures for this variable.
obama <- numeric()
for (i in 1:nsims) {
obama[i] <- (prop.dem[i,]>=.5)%*%Voters
hist(obama,30,main="Predicted number of Electoral Votes for Obama",xlab="",ylab="")
abline(v=270,col="dark grey",lty=1,lwd=3)

So, based on this (veeeeery simple and probably not-too-realistic!!) model, Obama has a pretty good chance of being re-elected. In almost all the simulations his share of the votes guarantees he gets at least the required 272 EVs $-$ in fact, in many possible scenarios he actually gets many more than that.

Well, I can only hope I'm not jinxing it!


  1. You may be interested in this:


    "Dynamic Bayesian Forecasting of Presidential Elections in the States"
    forthcoming in the Journal of the American Statistical Association.


  2. Thanks $-$ I've just flipped through the paper, but it seems really cool! Of course, the model I described is just a toy but, incidentally, we do get kind of similar results: here Obama is estimated to get 332 EVs, whereas my model predicts 303 (with a 95% credible interval of [272; 337]) EVs. Not the same, but the message is the same (that Obama should win and that the uncertainty underlying this is not huge $-$ given the state of available data).

    I'm sure these kind of things benefit greatly from complex and realistic models, and particularly from prior knowledge. For instance, I'm no real expert in US politics (other than a general interest in the stats issues associated with these kind of data), but I suppose that there may be some underlying clustering going on; perhaps states tend to cluster in the way they vote (because of social, political and economic reasons), so that could be included in a full hierarchical model to make better precisions?

    I think it's interesting that people doing serious work on these things tend to be conservative (and rightly so, as of course you don't want to get these numbers spectacularly wrong). But may be (just may be!) if the polls are anything to go by (within reason), the race is not as uncertain as it is depicted in the media...

  3. Hi Gianluca,
    a very interesting piece of work. I am sure lots of statisticians are interested in seeing how this is done,
    seeing as the media predictions were so spectacularly wrong. I would like to have a further play with it myself but coefplot2 seems to be unavailable at r-forge as the package
    "Failed to build:. . . It is not made available since it does not meet the policies."