Gianluca Baio's blog: July 2014

Friday, 25 July 2014

Pat pat

This is probably akin to an exercise in self-pleasing, but I'll indulge in this anyway to celebrate the fact that our paper on the Bias in the Eurovision song contest voting (the last in a relatively long series of posts on this is here) has now over 4,000 "article views".

The Journal of Applied Statistics website defines these as: "Article usage statistics combine cumulative total PDF downloads and full-text HTML views from publication date [23 Apr 2014, in our case] to 23 Jul 2014. Article views are only counted from this site."

In case you're wondering, neither Marta nor I have actually downloaded the paper to boost the numbers!

Monday, 7 July 2014

The Oracle (8) - let's go all the way!

This is (may be) the final post in the series dedicated to the prediction of the World Cup results $-$ I'll try and actually write another to wrap things up and summarise a few comments, but this will probably be a bit later on. Finally, we've decided to use our model, which so far has been applied incrementally, ie stage-by-stage, to predict the result of both the semifinals and the finals.

The first part is relatively straightforward; the quarter finals have been played and we do know the results that have occurred. Thus, we can re-iterate the procedure (which we described here) and i) update the data with the observed results; ii) update the "current form" variable and the offset; iii) re-run the model to estimate each team's propensity to score; iv) predict the result of the unobserved games $-$ in this case the two semifinals (Brazil-Germany and Argentina-Netherlands).

However, to give the model a nice twist, I thought we should include some piece of extra information that is available right now, ie the fact that Brazil will, for certain, play their semifinal without their suspended captain Thiago Silva and their injured "star player" Neymar (who will also miss the final, due to the gravity of his injury). Thus, we ran the model by modifying the offset variable (see a more detailed description here) for Brazil, to slightly decrease their "short-term" quality. [NB: if this were a "serious" model, we would probably try to embed these changes in a more formal way, rather than as "ad hoc" modifications to the general set up. Nevertheless, I believe that the possibility of dealing with additional information, possibly in the form of subjective/expert knowledge, is actually a strength of the modelling framework. Of course, you could say that the selection of the offset distribution is arbitrary and other possibilities were possible $-$ that's of course true and a "serious" model would certainly require more extensive sensitivity analysis at this stage!]

Using this formulation of the model, we get the following results, in terms of the overall probability of going through to the final (ie accounting for potential draws in the 90 minutes and then extra times and possibly penalties, as discussed here):

Brazil Germany 0.605 0.395
Argentina Netherlands 0.510 0.490

So, the second semifinal is predicted to be much tighter (nearly 50:50), while Brazil are still favourites to reach the final, according to the model prediction.

As I said earlier, however, this time we've gone beyond the simple one-step prediction and have used these results to also re-run the model before the actual results of the semifinals are known and thus predict the overall outcome, ie who's winning the World Cup.

Overall, our estimation gives the following probabilities of winning the championship (these may not sum to 1 because of rounding):

Brazil: 0.372
Germany: 0.174
Argentina: 0.245
Netherlands: 0.206

Of course, these probabilities encode extra uncertainty, because we're going one extra step forward in the future $-$ we don't know which of the potential futures will occur for the semifinals. Leaving the model aside), I think would probably like the Netherlands to win $-$ if only for the fact that in that way, Italy would still be the 2nd most frequent World Cup winners, only one title behind Brazil, and one and two above Germany and Argentina, respectively.

Thursday, 3 July 2014

The Oracle (7)

We're now down to 8 teams left in the World Cup. Interestingly, despite a pretty disappointing display by some of the (more or less rightly so) highly rated teams, such as Spain, Italy, Portugal or England, European sides are exactly 50% of the lot. Given the quarter final game between France and Germany, at least one European team is certain to reach the semifinals. Also, it is worth noticing that the 8 remaining teams are the group winners $-$ which kind of confirms Michael Wallace's point.

We've now re-updated the data, the "form" and the "offset" variables (as briefly explained here) using the results of the round of 16. The model had predicted (as shown in the graphs here) wide uncertainty for the potential outcomes of the games (also, we had not included the added complication of extra times & penalties $-$ more on this later). I believe this has been confirmed by the actual games. In many cases (in fact, probably all but the Colombia-Uruguay game, which was kind-of-dominated by the former), the games have been substantially close. As a result, we've observed a slightly higher than usual proportion of games ending up at extra times.

So, we've also complicated (further!) our model to estimate the result by including extra times and penalties. In a nutshell, when the game is predicted to be a draw (ie the predicted number of goals scored by the two teams is the same), then we've additionally simulated the outcome of extra times.

In doing this, we've used the same basic structure as for the regular time, but we've added a decremental factor to the linear predictor (describing the "propensity" of team A to score when playing against team B). This makes sense, since the duration of extra time is 1/3 of the normal game. Also, there is added pressure and teams normally tend to be more conservative. Thus, in this prediction, we've increased the chance of observing 0 goals and accounted for the shorter time played. If the prediction is still for a draw, then we've determined the winner by assuming that penalty shoot outs essentially are a randomising device $-$ each team have 50% chance of winning them.

These are the contour plots for the posterior predictive distribution of the goals scored in the quarter finals, based on our revised model.

Basically all games are again quite tight $-$ perhaps with the (reasonable?) exception of Netherlands-Costa Rica in which the Dutch are favourite and predicted to have a higher chance of scoring more goals (and therefore winning the game).

As shown in the above graph, draws are quite likely in almost all the games; the European derby is probably the closest game (and this seems to make sense given both the short- and long-term standing of the two teams). Brazil and Argentina both face tough opponents (based on the model $-$ but again, in line with what we've seen so far).

Using the result of the model in terms of prediction of the results at extra time & penalties, we estimate the overall probability of winning the game (ie either within 90 minutes or beyond) as

Brazil	Colombia	0.657	0.343
Netherlands	Costa Rica	0.776	0.224
France	Germany	0.497	0.503
Argentina	Belgium	0.607	0.393

(in the above table, the third and fourth columns indicate, respectively, the predicted chance that the team in column one and two, respectively, win the game and progress to the semifinals).

One final remark, which I think it's generally interesting, is that by the time we've reached the quarter finals, the value of the "current form" variable for Brazil (who started as hot favourites based on the evidence synthesis of the published odds that we've used to define it at the beginning of the tournament) is lower than that of their opponent. But again, Colombia have sort of breezed through all of their games so far, while Brazil have kind of stuttered and have not won games that they probably should have (taking at face value their "strength"). This doesn't seem enough to make Colombia favourites in their game against the host $-$ but beware of surprises! After all, the distribution of the possible results is not so clear cut...

Wednesday, 2 July 2014

Short course: Bayesian methods in health economics

Chris, Richard and I tested this last March in Canada (see also here) and things seem to have gone quite well. So we have decided to replicate the experiment (so that we can get a bigger sample size!) and do the short course this coming November (3-5th), at UCL.

Full details (including links for registration) are available here. As we formally say in an advert we've circulated on a couple of relevant mailing lists:

"This course is intended to provide an introduction to Bayesian analysis and MCMC methods using R and MCMC sampling software (such as OpenBUGS and JAGS), as applied to cost-effectiveness analysis and typical models used in health economic evaluations.

The course is intended for health economists, statisticians, and decision modellers interested in the practice of Bayesian modelling and will be based on a mixture of lectures and computer practicals, although the emphasis will be on examples of applied analysis: software and code to carry out the analyses will be provided. Participants are encouraged to bring their own laptops for the practicals.

We shall assume a basic knowledge of standard methods in health economics and some familiarity with a range of probability distributions, regression analysis, Markov models and random-effects meta-analysis. However, statistical concepts are reviewed in the context of applied health economic evaluations in the lectures."

The timetable and additional info are here.

Gianluca Baio's blog