Tuesday, 31 March 2015

House of stats

[This is a rather long joint post with Roberto Cerina and compounds our paper in the April 2015 issue of Significance]

1. Prelude (kind-of unrelated to what follows). 
Last week, Marta and I finished watching the last series of House of Cards, the Netflix adaptation of the original BBC series (which I may have liked even more... I've not decided yet). The show is based around the US politics and the fictional President Underwood.

Although political science is of course not my main research area (or even one of my research areas at all, to be more precise!), I'm in general interested in politics and I am interested in (some of) the application of statistical methods to this area. Thus, last year I decided to offer an undergraduate project which would be vaguely related to statistical modelling of political data (eg on polls, elections, etc).

The project caught Roberto's attention and he decided to take it on $-$ in fact, he has even started to ask for things to read before he was supposed to begin the project! Originally, we intended to work on Italian data (which would have been nice, given we're both Italian and the proliferation of parties, which would have made for more complex modelling); but it was much easier to find data about last year's US Senate elections and so we decided to use those.


2. Background 
Modelling US Senate elections is more difficult than the presidential elections for two main reasons: firstly, there are far fewer polls per state than there are at the national level for the main "event". This can also mean a high impact for the “house effect” $–$ where polls favour one party or the other, depending on the polling house conducting the study. 

The second problem has to do with correctly accounting for the effect of the economy on voting behaviour. In Senate elections, this is thought to be weaker due to the lack of precise “blame” to be directed at the incumbent. Presidents are seen to be responsible for the economy, and are heavily rewarded or penalised for it. However, the same cannot be said (at least to the same extent) for a senator. Nevertheless, we want to include some long-term factors in our model, to compound the short-term shocks produced by the ongoing polls; to this aim, we collected state-specific data on macroeconomic variables and included them in the model.


3. The model
The basic idea is to extend an interesting model by Drew Linzer, which was developed for the 2012 US presidential elections. In a nutshell, the model aims at combining data from the polls that start being conducted weeks before the elections with increasing intensity and data on some historical trends. In particular, the objective is to perform the estimation of the results of the elections in a dynamic way, so that most recent polls tend to weigh more than older ones. 

Our main data are the polls; in particular the number $y_{itk}$ of respondents who declared they would vote for the Republican candidate in the $k-$th poll in the $i-$th state (we consider the 33 in which elections were taking place) at the $t-$th week of the election campaign (we consider a total of $T=22$ weeks) out of the sample size $n_{itk}$

Then we aggregate the data over weeks as $N_{it}=\sum_{k} n_{itk} $ and $Y_{it} = \sum_k y_{itk}$ and model $Y_{it} \sim\mbox{Binomial}(p_{it},N_{it})$The parameter $p_{it}$ is the object of our inference represents the probability that a random elector votes for the Republican candidate, in state $i$ at week $t$ and we model it as
$$ \mbox{logit}(p_{it}) = \alpha_{it} + \beta_t $$
where $\alpha_{it}$ is a state-specific effect on a given week of the campaign and $\beta_t$ is the common trend amongst Republican candidates at the national level.

Following the original model of Linzer, we then assume a reverse random walk structure on the $\beta$-s:
$$\beta_{t}| \beta_{t+1} \sim \mbox{Normal}(\beta_{t+1}, \sigma^2_\beta ),$$
with $\beta_T := 0$. This encodes the assumption that individual preferences at week $t+1$ will be affected by the preferences at week $t$. In addition, because of the anchoring at 0 for election week, we imply that, that as election day becomes closer, the Republican vote share will not be affected by national campaign effects on election week. 

As for the parameter $\alpha_{it}$ we assume a prior specification $\alpha_{it}\mid \alpha_{it+1}\sim\mbox{Normal}(\alpha_{it+1},\sigma^2_\alpha)$. In this case, the anchoring at time $T$ is in terms of $\alpha_{iT} \sim \mbox{Normal}\left(\mbox{logit}(h_i),s^2_h\right)$, where $h_i$ is the historical forecast of the incumbent party vote share and is modelled using a full Bayesian specification as a regression with state-specific macroeconomic factors as well as nationwide structural indicators. Specifically, we regress the incumbent candidate’s vote share on factors such as a state level dummy variable representing the incumbency of a candidate, the incumbent president's approval rating and a dummy variable representing affinity of the incumbent party, and then convert these long run predictions for the incumbent candidates to Republican party predictions. such as a state level dummy variable representing the incumbency of a candidate, the incumbent president's approval rating and a dummy variable representing affinity of the incumbent party. The full Bayesian specification as well as the inclusion of state-specific variables is what differentiates our model from Linzer's.

In a nutshell, the model can be represented graphically as in the following graph.
4. Results
The first output of our model is the estimation, at election week, of the outcome of each election. 


We compared the predicted two-party vote share distribution with the actual results. The prediction interval is reported as 2 standard deviations (sd) around the predicted mean Republican vote share. A safe Republican seat is coloured red and is defined as the mean being at least 2 sd greater than the 0.5 cut-off; a likely Republican seat is coloured light red and is defined when the mean is larger than the 0.5 cut-off, but the lower tail touches the line. Democratic seats are defined in the same way, with a blue colour scale.  The left axis contains the State names, in the Incumbent party’s colours. The green squares belong to the predictions of our Bayesian adaptation of the model for the historical trends. The predicted probability of a Republican senate takeover, according to the model, was 94% by the end of election week. The most probable outcome under our model is predicted to be a Republican net gain of 7 seat, 1 more than they need to take over the Senate. Republicans had an overwhelming advantage on election day. The model assigns Harry Reid, the Democratic Senate majority leader, only a 6% chance of keeping his job.

We can also look at the dynamic forecasts for specific states, which show how a stakeholder in a specific race (e.g. the Democratic national Committee) updates his predictions as the weeks go by, and can use this model to allocate resources amongst the races. These are also useful for a “post-mortem” analysis of the vote, and we can see how actual campaign events match up to changes in the weekly prediction intervals.

For example, the following graph shows the situation for Kansas. This is a good example of what happens when the structural forecast does not give us much information on the re-election chances of the incumbent, and the polls end up being skewed. Here the race was extremely uncertain at the beginning of the monitoring, as one can see from the width of the prediction interval at week 1 (21 weeks to go). 



Incumbent Republican Pat Roberts is shown to be quite unpopular, and is consistently low in the polls up to week 11 (11 weeks to go). This is consistent with the political news at the time, with Roberts barely making it out alive from his tea-party primary challenge, with less than 50% of the vote. After a marginal gain in the following couple of weeks, due to the indecision in the Democratic party as to whether it was worth competing or endorsing the Independent challenger, Roberts drops due to the Republican campaign effect, coinciding with the low poll numbers for congress. 

Greg Orman, the Independent opponent (here modelled as a democrat for computational purposes), had to make up for the lack of party structure behind his campaign with personal finances, giving over $1 million to his own campaign. To help Orman get a chance at defeating Roberts, the official Democratic candidate Chad Taylor quit the race at the beginning of September. This seems to have contributed in stalling Roberts' rise for a couple of weeks (16 and 17), but doesn't seem to have had the overall desired effect of tipping the balance in favour of the Independent. The steady rise of Roberts from then on suggests that this was not a race decided by particular events, rather it was a case of a consistently better campaign on the part of the incumbent. Orman was outspent also by SuperPacs, with outside groups supporting Roberts outspending the independent's by a 2:1 margin. Our model judges the race a toss-up, giving a tiny advantage to the Republican candidate. However, Roberts won by over 5 points! This suggests that the pollsters misjudged the race. 

At the other end of the spectrum, is the situation for North Carolina $-$ kind of our Achille's heel. Analysing what happened here, one can immediately see that the structural model for North Carolina is solid: it reduces our uncertainty of the entire race to less than 10 percentage points, with the Republican challenger Thom Tillis at a slight disadvantage, all the way up to week 7 (15 weeks to go).


An odd ball in this race, was represented by Libertarian candidate Sean Haugh, who polled vertiginous high for a third party candidate all the way up to election day. His presence breaks the assumption that allowed us to model the 2-party vote share, which is that it is ok to only consider Democrats and Republicans, as long as the assumption that independents, third party candidates and non-respondents in polls break evenly for both parties.

This wasn't the case in this race, and especially in the months of June and July, Haugh polled with heights of 11% and a mid-June to mid-August Average of around 8%. However at mid August, his vote share suddenly drops to about 5%, coinciding with Tills gaining momentum and bringing the race to a toss-up. The reasons for such drop are not certain, however it is not far-fetched to think that disgruntled Republican voters wanted to send a message to the Republican establishment, represented by Tillis, and considered voting for the Libertarian candidate as protest. In accordance to the enlightened preferences theory, as the Republican voters learnt that Tillis was their only chance at not getting the incumbent Democrat Kay Hagan re-elected, the protest voters gradually came back to the Republican base. 

It is worth saying that they were heavily pushed by the two campaigns and by outside groups, who ended up pouring close to $90 million in the campaign, dubbing it ``the most expensive Senate election in history''. This process kept going all the way up to election day, where the Libertarian candidate ended up leaking all but 3.7% of the vote.

Support for Kay Hagan seems to be consistent throughout, and polls don't seem to incorporate much variability, giving the incumbent a small lead. Tillis, the Republican opponent, suffered from the usual Republican drop in week 15, but managed to put on a good show in the last tv debates. Especially the last debate (around 2 weeks to election day), was a big hit for Tillis as Hagan was a no-show in the televised debate amid criticism over her husband reaping personal benefits from the Obama stimulus package. This propelled him closer to Hagan, but he never quite caught up in the polls. On election day, he won with a 1.5% lead.

No comments:

Post a Comment