Today I've received an email from a prospective PhD candidate, who says that he (or she) would like to do his (or her) PhD under my supervision on a topic of mutual interest. Except his (or her) interests are in Physics or Applied Mathematics.

I know I'm being a bit mean here and that there's no reason why physicists or mathematicians can't work on Bayesian statistics. But this person actually goes on to state that their expertise is in Rheology and Dynamics of Complex Fluids e.g. particle suspension, droplet and polymer suspension and various other multiphase flows. *That* I find it a bit more complex to qualify as a set of mutual interests (to mine!)...
The other day, my colleague Gareth pointed out a very interesting piece of news. The new version of Stata is just out. Now, I'm not a super-Stata user (although I think it's a good package), but the interesting news is that they have now developed a specific module for Bayesian analysis.

This was probably coming $-$ I've recently reviewed a new book exploring the integration of Stata and BUGS, pretty much similar to the R2OpenBUGS package for R (there are in fact a few similar packages to interface MCMC software such as JAGS or Stan or WinBUGS to R). The new version of Stata can now skip this step (by and large) as many "more or less" Bayesian models have been hard-coded and can be fitted using standard Stata commands.

I'll check this out!
**Disclaimer**: I'm fully aware of the obvious conflict of interest here, but also I think that this looks really good, so I'll write about it anyway.

This post is to highlight that Marta's and Michela's book on Spatial and Spatio-temporal Bayesian Models with R - INLA is finally out (I think it can be pre-ordered although it will be officially available early in May). I think the book is really good as it describes the underlying theory of INLA and makes the effort of presenting a unified framework, including examples and R code.

The only downside to this, now that the book is sort-of out of Marta's mind, is that she'll go back to architect-mode and start again to suggest new ways in which we can move the furniture (or even worse, the rooms) around. I should find another topic for her to write a book, soon...
A couple of weeks ago we decided to create a more formal website for our research group within the department of Statistical Science at UCL.

The group includes the PhD students involved in health economic-related topics (basically all under my supervision, although for all of them the help of my colleagues who act as second supervisors is being invaluable!) and some of my colleagues in the department.

The website collects all the relevant information, including a description of the current research projects (mainly at PhD level, but also for the MSc projects and, potentially, for Undergraduate projects), as well as details of our seminar series. We'll also link to some (forthcoming) publications and news.

I've fought (and just about won) to resist the temptation of using the acronym *Sheeva,* as in **S**tatistics and **H**ealth **E**conomic **Eva**luation, for the group $-$ I guess it would have really been too geeky...
This is probably just me being a bit grumpy, but I guess this happens to many people. I have just received an email (and it's not the first time) from a random scientific journal (this time it's a medical journal) inviting me to publish my research.

Except that my research has nothing to do whatsoever with the topics that this journal is interested in. Also, I love how invariably these emails say something like: "*If possible, I would appreciate receiving your submission by*" tomorrow (or a date very very close in the future) and usually also say something like: "*Please respond to this mail by*" today.

At least this one wished me "a nice and healthy day ahead!!"...
*[This is a rather long joint post with Roberto Cerina and compounds our paper in the April 2015 issue of Significance]*

**1. Prelude (kind-of unrelated to what follows).**
Last week, Marta and I finished watching the last series of House of Cards, the Netflix adaptation of the original BBC series (which I may have liked even more... I've not decided yet). The show is based around the US politics and the fictional President Underwood.

Although political science is of course not my main research area (or even one of my research areas at all, to be more precise!), I'm in general interested in politics and I am interested in (some of) the application of statistical methods to this area. Thus, last year I decided to offer an undergraduate project which would be vaguely related to statistical modelling of political data (eg on polls, elections, etc).

The project caught Roberto's attention and he decided to take it on $-$ in fact, he has even started to ask for things to read before he was supposed to begin the project! Originally, we intended to work on Italian data (which would have been nice, given we're both Italian and the proliferation of parties, which would have made for more complex modelling); but it was much easier to find data about last year's US Senate elections and so we decided to use those.

**2. Background**
Modelling US Senate elections is more difficult than the presidential elections for two main reasons: firstly, there are far fewer polls per state than there are at the national level for the main "event". This can also mean a high impact for the “house effect” $–$ where polls favour one party or the other, depending on the polling house conducting the study.

The second problem has to do with correctly accounting for the effect of the economy on voting behaviour. In Senate elections, this is thought to be weaker due to the lack of precise “blame” to be directed at the incumbent. Presidents are seen to be responsible for the economy, and are heavily rewarded or penalised for it. However, the same cannot be said (at least to the same extent) for a senator. Nevertheless, we want to include some *long-term* factors in our model, to compound the *short-term* shocks produced by the ongoing polls; to this aim, we collected state-specific data on macroeconomic variables and included them in the model.

**3. The model**
The basic idea is to extend an interesting model by Drew Linzer, which was developed for the 2012 US presidential elections. In a nutshell, the model aims at combining data from the polls that start being conducted weeks before the elections with increasing intensity and data on some historical trends. In particular, the objective is to perform the estimation of the results of the elections in a dynamic way, so that most recent polls tend to weigh more than older ones.

Our main data are the polls; in particular the number $y_{itk}$ of respondents who declared they would vote for the Republican candidate in the $k-$th poll in the $i-$th state (we consider the 33 in which elections were taking place) at the $t-$th week of the election campaign (we consider a total of $T=22$ weeks) out of the sample size $n_{itk}$

Then we aggregate the data over weeks as $N_{it}=\sum_{k} n_{itk} $ and $Y_{it} = \sum_k y_{itk}$ and model $Y_{it} \sim\mbox{Binomial}(p_{it},N_{it})$. The parameter $p_{it}$ is the object of our inference represents the probability that a random elector votes for the Republican candidate, in state $i$ at week $t$ and we model it as
$$ \mbox{logit}(p_{it}) = \alpha_{it} + \beta_t $$
where $\alpha_{it}$ is a state-specific effect on a given week of the campaign and $\beta_t$ is the common trend amongst Republican candidates at the national level.

Following the original model of Linzer, we then assume a *reverse random walk* structure on the $\beta$-s:
$$\beta_{t}| \beta_{t+1} \sim \mbox{Normal}(\beta_{t+1}, \sigma^2_\beta ),$$
with $\beta_T := 0$. This encodes the assumption that individual preferences at week $t+1$ will be affected by the preferences at week $t$. In addition, because of the anchoring at 0 for election week, we imply that, that as election day becomes closer, the Republican vote share will not be affected by national campaign effects on election week.
As for the parameter $\alpha_{it}$ we assume a prior specification $\alpha_{it}\mid \alpha_{it+1}\sim\mbox{Normal}(\alpha_{it+1},\sigma^2_\alpha)$. In this case, the anchoring at time $T$ is in terms of $\alpha_{iT} \sim \mbox{Normal}\left(\mbox{logit}(h_i),s^2_h\right)$, where $h_i$ is the historical forecast of the incumbent party vote share and is modelled using a full Bayesian specification as a regression with state-specific macroeconomic factors as well as nationwide structural indicators. Specifically, we regress the incumbent candidate’s vote share on factors such as a state level dummy variable representing the incumbency of a candidate, the incumbent president's approval rating and a dummy variable representing affinity of the incumbent party, and then convert these long run predictions for the incumbent candidates to Republican party predictions. such as a state level dummy variable representing the incumbency of a candidate, the incumbent president's approval rating and a dummy variable representing affinity of the incumbent party. The full Bayesian specification as well as the inclusion of state-specific variables is what differentiates our model from Linzer's.

In a nutshell, the model can be represented graphically as in the following graph.
**4. Results**
The first output of our model is the estimation, at election week, of the outcome of each election.

We compared the predicted two-party vote share distribution with the actual results. The prediction interval is reported as 2 standard deviations (sd) around the predicted mean Republican vote share. A safe Republican seat is coloured red and is defined as the mean being at least 2 sd greater than the 0.5 cut-off; a likely Republican seat is coloured light red and is defined when the mean is larger than the 0.5 cut-off, but the lower tail touches the line. Democratic seats are defined in the same way, with a blue colour scale. The left axis contains the State names, in the Incumbent party’s colours. The green squares belong to the predictions of our Bayesian adaptation of the model for the historical trends. The predicted probability of a Republican senate takeover, according to the model, was 94% by the end of election week. The most probable outcome under our model is predicted to be a Republican net gain of 7 seat, 1 more than they need to take over the Senate. Republicans had an overwhelming advantage on election day. The model assigns Harry Reid, the Democratic Senate majority leader, only a 6% chance of keeping his job.
We can also look at the dynamic forecasts for specific states, which show how a stakeholder in a specific race (e.g. the Democratic national Committee) updates his predictions as the weeks go by, and can use this model to allocate resources amongst the races. These are also useful for a “post-mortem” analysis of the vote, and we can see how actual campaign events match up to changes in the weekly prediction intervals.

For example, the following graph shows the situation for Kansas. This is a good example of what happens when the structural forecast does not give us much information on the re-election chances of the incumbent, and the polls end up being skewed. Here the race was extremely uncertain at the beginning of the monitoring, as one can see from the width of the prediction interval at week 1 (21 weeks to go).

Incumbent Republican Pat Roberts is shown to be quite unpopular, and is consistently low in the polls up to week 11 (11 weeks to go). This is consistent with the political news at the time, with Roberts barely making it out alive from his tea-party primary challenge, with less than 50% of the vote. After a marginal gain in the following couple of weeks, due to the indecision in the Democratic party as to whether it was worth competing or endorsing the Independent challenger, Roberts drops due to the Republican campaign effect, coinciding with the low poll numbers for congress.

Greg Orman, the Independent opponent (here modelled as a democrat for computational purposes), had to make up for the lack of party structure behind his campaign with personal finances, giving over $1 million to his own campaign. To help Orman get a chance at defeating Roberts, the official Democratic candidate Chad Taylor quit the race at the beginning of September. This seems to have contributed in stalling Roberts' rise for a couple of weeks (16 and 17), but doesn't seem to have had the overall desired effect of tipping the balance in favour of the Independent. The steady rise of Roberts from then on suggests that this was not a race decided by particular events, rather it was a case of a consistently better campaign on the part of the incumbent. Orman was outspent also by SuperPacs, with outside groups supporting Roberts outspending the independent's by a 2:1 margin. Our model judges the race a toss-up, giving a tiny advantage to the Republican candidate. However, Roberts won by over 5 points! This suggests that the pollsters misjudged the race.

At the other end of the spectrum, is the situation for North Carolina $-$ kind of our Achille's heel. Analysing what happened here, one can immediately see that the structural model for North Carolina is solid: it reduces our uncertainty of the entire race to less than 10 percentage points, with the Republican challenger Thom Tillis at a slight disadvantage, all the way up to week 7 (15 weeks to go).

An odd ball in this race, was represented by Libertarian candidate Sean Haugh, who polled vertiginous high for a third party candidate all the way up to election day. His presence breaks the assumption that allowed us to model the 2-party vote share, which is that it is ok to only consider Democrats and Republicans, as long as the assumption that independents, third party candidates and non-respondents in polls break evenly for both parties.

This wasn't the case in this race, and especially in the months of June and July, Haugh polled with heights of 11% and a mid-June to mid-August Average of around 8%. However at mid August, his vote share suddenly drops to about 5%, coinciding with Tills gaining momentum and bringing the race to a toss-up. The reasons for such drop are not certain, however it is not far-fetched to think that disgruntled Republican voters wanted to send a message to the Republican establishment, represented by Tillis, and considered voting for the Libertarian candidate as protest. In accordance to the enlightened preferences theory, as the Republican voters learnt that Tillis was their only chance at not getting the incumbent Democrat Kay Hagan re-elected, the protest voters gradually came back to the Republican base.

It is worth saying that they were heavily pushed by the two campaigns and by outside groups, who ended up pouring close to $90 million in the campaign, dubbing it ``the most expensive Senate election in history''. This process kept going all the way up to election day, where the Libertarian candidate ended up leaking all but 3.7% of the vote.

Support for Kay Hagan seems to be consistent throughout, and polls don't seem to incorporate much variability, giving the incumbent a small lead. Tillis, the Republican opponent, suffered from the usual Republican drop in week 15, but managed to put on a good show in the last tv debates. Especially the last debate (around 2 weeks to election day), was a big hit for Tillis as Hagan was a no-show in the televised debate amid criticism over her husband reaping personal benefits from the Obama stimulus package. This propelled him closer to Hagan, but he never quite caught up in the polls. On election day, he won with a 1.5% lead.

Because I'm involved in many collaborative projects, some of which luckily involving LaTeX, and because I'm trying (sort-of succeeding) to spend as much time as possible outside the office (mostly failing) to work on the books, in the past few weeks I've found myself wanting some track-changes utility for the work I was sharing with my LaTeX-savvy colleagues. *[Could this be a candidate for the longest opening sentence of a post, ever?]*

I had a quick look online and found this very nice package $-$ it's probably well established, but I'd not encountered it before, so I was very pleased to discover it.

It works quite smoothly and lets you annotate the original .tex file with changes, additions and notes. And what's even nicer is that the compiled document has some mark-up (eg different colour for new text), but it's not very cluttered, so that you can read fairly easily the current version with notes.

Speaking of LaTeX, I also found this other couple of useful programmes: the first one is a perl script that creates the bibtex code of a given reference $-$ basically you can copy and paste the full reference of a text of interest and the script will return the LaTeX code to paste into a .bib file. The second one searches PubMed and retrieve the LaTeX code for the hits that match the search string.

Again, both are probably quite old and well established. But it was quite serendipitous.
This is not really news any more, but I still think it's an interesting story.

Last week the journal *Basic and Applied Social Psychology* has published an editorial setting out their views (or rather prescriptions) for how statistical analyses should be conducted in papers that seek publication with them.

The editorial starts by effectively banning the use of p-values and null hypothesis significance testing, which "*is invalid, and thus **authors would be not required to perform it*". Then it goes on to say that "*Bayesian procedures are more interesting*", but also suffer from issues with "*Laplacian assumption*" (non-informative priors) and therefore they "*reserve the right to make case-by-case judgments, and thus Bayesian procedures are neither **required nor banned from BASP*.

The conclusion of the editorial is then that basically psychologists do not need to bother with *any* inferential procedure, "*because the state of the art remains uncertain. However, BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible*".

This has caused quite a stir among many statisticians (and I think psychologists should join the protest!). Here's a series of responses by important statisticians. I personally think that some of the problem at least is the view that statistics is some sort of recipe-book: if you have such and such data collection, then do a t-test; if you have such and such a design, then do an ANOVA; or perhaps if you have this other data, then use meta-analysis and throw in some priors-kind of thing $-$ I'm no real expert here, but I think that psychology as a field suffers particularly from this problem (perhaps for historical reasons?).

Most importantly, this reminds me of my first ISBA conference, back in 2006 (I think that's the last time it was held in the Valencia area). The final night of the conference, some attendees prepare some entertainment and that year, together with a few (back then) young friends, we prepared a news broadcast $-$ we spent most of the last day of the conference doing this, rather than attending the talks, I'm half proud, half ashamed to confess.

Anyway, among the "serious" news we were reporting was a riot that had happened outside the conference hotel where frequentists had come in masses to protest, waiving placards reading "We value p-value!" (worryingly, we also reported that Alan Gelfand, then-President of ISBA, had to be transferred to a secure location).