Gianluca Baio's blog: June 2014

Saturday 28 June 2014

Break!

Just to break the mono-thematic nature of the recent posts, I thought I'd just linked to this article which has appeared in the Significance website.

That's an interesting analysis conducted by researchers at the LSE, demystifying the myth that migrants in the UK have unfair advantages in accessing social housing.

I believe these kinds of findings should be made as widely accessible as possible, because of the incredible way in which these myths are used to build up conspiracy theories or political arguments that are passed for based on "facts", but are, in fact, just a partial (if not biased at all) view of a phenomenon.

Our paper on the Eurovision song contest (ESC) (here, here and here) is of course far less important to society than housing; but it struck me that a few of the (nasty) comments we got in some of the media that reported the press release of our findings were effectively along the lines: I think Europe hate Britain and the fact that we don't win the ESC is evidence to that; you say otherwise based on numbers; that doesn't move my position by an inch.

At the time, I felt a bit petty because this bothered me a bit. But, a couple of days later, I came across a couple of articles (here and here) on one politician who used the same kind of reasoning to make their point (eg Britain should get out of the EU $-$ the ESC is evidence that they hate us). Which bothered me even more...

Friday 27 June 2014

The Oracle (6)

Quick update, now that the group stage is finished. We needed a few tweaks to the simulation process (described in some more details here), which we spent some time debating and implementing.

First off, the data on the last World Cups show that during the knock out stage, there are substantially fewer goals scored. This makes sense: from tomorrow it's make or break. This wasn't too difficult to deal with, though $-$ we just needed to modify the distribution for the zero component of the number of goals ($\pi$, as described here). In this case, we've used a distribution centered on around 12% with most of the mass concentrated between 8% and 15%.

These are the predictions for the 8 games. Brazil, Germany, France and (only marginally) Argentina have a probability of winning exceeding 50%. The other games look closer.

Technically, there is a second issue, which is of course that in the knock out stage draws can't really happen $-$ eventually game ends either after extra time, or at penalties. For now, we'll just use this prediction, but I'm trying to think of a reasonable way to include the extra complication in the model; the main difficulty is that in extra time the propensity to score drops even further $-$ about 30% of the games that go to extra time end up at penalties. I'll try and update this (if not for the this round, possibly for the next one).

Monday 23 June 2014

The Oracle (5. Or: Calibration, calibration, calibration...)

First off, a necessary disclaimer: I haven't been able to write this post before a few of the games of the final round of the group stage have been played, but I have not watched the games so far and have run the model to predict round 3 as if none of the games had been played.

Second, we've been thinking about the model and whether we could improve it in light of its predictive ability proven so far. Basically, as I mentioned here, the "current form" variable may not be very well calibrated to start with $-$ recall it's based on an evidence synthesis of the odds against each of the teams, which is then updated given the observed results and weighting them by the predicted likelihood of each outcome occurring.

Now: the reasoning is that, often, to do well at competitions such as the World Cup, you don't necessarily need to be the best team (although this certainly helps!) $-$ you just need to be the best in that month or so. This goes, it seems to me, over and above the level and impact of "current form".

To make a clearer example, consider Costa Rica (arguably the dark horses of the tournament, so far): the observed results (two wins against relatively highly rated Uruguay and Italy) have improved their "strength", by nearly doubling it in comparison to the initial value (based on the evidence synthesis of a set of published odds). However, before any game was played, they were considered the weakest team among the 32 participating nations. Thus, even after two big wins (and we've accounted for the fact that these have been big wins!), their "current form/strength" score is still only 0.08 (on a scale from 0 to 1).

Consequently, by just re-running the model including all the results from round 2 and the updated values for the "current form" variable, the prediction for their final game is a relatively easy win for their opponent, England, who on the other hand have had a disappointing campaign and have already been eliminated. The plots below show the prediction of the outcomes of the remaining 16 games, according to our "baseline" model.

So we thought of a way to potentially correct for this idiosyncrasy of the model [Now: if this were serious work (well $-$ it is serious, but it is also a bit of fun!) I wouldn't necessarily do this in all circumstances (although I believe what I'm about to say makes some sense)].

Basically, the idea is that because it is based on the full dataset (which sort of accounts for "long-term" effects), the "current form" variable describes the increase (decrease) in the overall propensity to score of team A when playing team B. But at the World Cup, there are also some additional "short-term" effects (eg the combination of luck, confidence, good form, etc) that the teams experience just in that month.

We've included these in the form of an offset, which we first compute (I'll describe how in a second) and then add to the estimated linear predictor. This, in turn, should make the simulation of the scores for the next games more in line with the observed results $-$ thus making for better prediction (and avoiding not-too-realistic prediction).

The computation of the offset is like so: first, we compute the difference between the expected number of points accrued so far by each of the team and the observed value. Then, we've labelled each team as doing "Much better", "Better", "As expected", "Worse" or "Much worse" than expected, according to the magnitude of the difference between observed and expected.

Since each team have played 2 games so far, we've applied this rule:

Teams with a difference of more than 4 points between observed and expected are considered to do "much better" (MB) than expected;
Teams with a difference of 3 or 2 points between observed and expected are considered to do "better" (B) than expected;
Teams with a difference between -1 and 1 point between observed and expected are considered to do just "as expected" (AE);
Teams with a difference of -2 or -3 points between observed and expected are considered to do "worse" (W) than expected;
Teams with a difference of more than -4 points between observed and expected are considered to do "much worse" (MW) than expected.

Roughly speaking this means that if you're exceeding expectation by more than 66% then we consider this to be outstanding, while if you're difference with the expectation is within $\pm$ 20%, then you're effectively doing as expected. Of course, this is an arbitrary categorisation $-$ but I think it is sort of reasonable.

Then, the offset is computed using some informative distributions. We used Normal distributions based on average inflations (deflations) of 1.5, 1.2, 1, 0.8 and 0.5, respectively for MB, B, AE, W and MW performances. We choose the standard deviations for these distributions so that for teams performing "much better" than expected the chance of an offset greater than 1 on the natural scale (meaning an increase in the performance predicted by the "baseline" model) would be approximately 1 (for MB), .9 (for B), .5 (for AE), .1 (for W) and 0 (for MW). The following picture shows this graphically.

Including the offsets computed in this way produces the results below.

The Costa Rica-England game is now much tighter $-$ England are still predicted to have a higher chance of winning it, but the joint posterior predictive distribution of the goals scored looks quite symmetrical, indicating how close the game is predicted to be.

So, based on the results of the model including the offset, these are our predictions for the round of 16.

Thursday 19 June 2014

The Oracle (4)

As promised, some consideration of our model performance, so far. I've produced the graph below, which for each of the first 16 games (ie the games it took for all the 32 teams to be involved once) shows the predictive distribution of the results. The blue line (and the red label) indicate the result that was actually observed.

I think that the first thing to mention is that, in comparison to simple predictions that are usually given as "the most likely result", we should acknowledge quite a large uncertainty in the predictive distributions. Of course, we know well that our model is not perfect and part of it may be due to its inability to fully capture the stochastic nature of the phenomenon. But I think much of this uncertainty is real $-$ after all, often games really are decided by episodes and luck.

For example, we would have done much better in terms of predicting the result for Switzerland-Ecuador if the final score had been 1-1, which it was until the very last minute of added time. Even an Ecuador win by 2-1 would have been slightly more likely according to our model. But seconds before Switzerland winner, Ecuador had a massive chance to score themselves $-$ had they taken it, our model would have looked a bit better...

On the plus side, also, I'd say that in most cases the actual result was not "extreme" in the predictive distribution. In other words, the model often predicted a result that was reasonably in line with the observed truth.

Of course, in some cases the model didn't really have a clue, considering the actual result. For example, we couldn't predict Spain's trashing at the end of the Dutch (the observed 1-5 was way on the right tail of the predictive distribution). To a lesser extent, the model didn't really see Costa Rica's victory against Uruguay coming, or the size of Germany's win against Portugal. But I believe these were (more or less) very surprising results $-$ I don't think many people would have bet good money on them!

May be, as far as our model is concerned, the "current form" variable (we talked about this here) is sort of non-perfectly calibrated. Perhaps, we start with too high a value for some of the teams (eg Spain or Uruguay, or even Brazil, to some degree), when in reality they were not so strong (or at least as stronger than the rest). For Spain especially, this is probably not too surprising: after all, in the last four World Cups the reigning champions have been eliminated straight after the first round in three occasions (France in 2002, Italy in 2010 and now Spain $-$ Brazil didn't go past the quarter finals in 2006, which isn't great either).

If one were less demanding and were to measure the model performance on how well it did in predicting wins/losses (see the last graph here) instead of the actual score, then I think the model would have done quite OK. The outcome of 9 out of 16 games was correctly predicted; in 4 cases, the outcome was the opposite as predicted (ie we said A win and actually B won $-$ but one of these is the infamous Switzerland-Ecuador, one is the unbelievable Spain defeat to the Netherlands and another was a relatively close game such as Ghana-USA). 2 were really close games (England-Italy and Iran-Nigeria) that could have really gone either way, but our uncertainty was clearly reflected in nearly Uniform distributions. In one case (Russia-South Korea), the model had predicted a relatively clear win, while a draw was observed.

Wednesday 18 June 2014

The Oracle (3)

Yesterday was the end of round 1 for the group stage $-$ this means that all 32 teams have played at least once (in fact, Brazil and Mexico have now played twice). So, it was time for us to update our model including a new measure of "current form" for each of the team.

We based this on the previous value (which as explained here was estimated using a synthesis of published odds against each of the teams, suitably rescaled to produce a score in [0;1], with values close to 0 indicating a weak/not in form team). These were modified accounting for the result of the first game played in the World Cup.

In particular, we computed the increase (decrease) in the original score weighting by the estimated probability of actually getting the observed result $-$ I think this makes sense: a strong team beating a weak opponent will probably increase in their current form (eg because this will naturally boost their confidence and get them closer to qualifying for the next round). But this increase will probably be marginal, in comparison to a weak team surprisingly beating one of the favourites.

By weighing the updated scores we should get a better quantification of the current strength of each team. Also, by using some form of average including the starting value, we are not overly moved by just one result. So, for example, Spain do decrease their "form score" from 0.93 to 0.68 by virtue of their not very likely (and yet observed!) thorough defeat against the Netherlands; but without accounting for the very high original score, the updated one would have been 0.42 $-$ quite lower.

So, we can re-run the model with these two extra features:

the observed results for games 1-17 (including Brazil-Mexico);
the updated form score for each of the teams

and use the new estimated values to predict the results of the next round of games (18-32). The graphs below show contours of the estimated joint predictive distributions of the scored goals. I've also included the 45 degrees line $-$ that would be where the draws occur, so if the distribution is effectively bisected by the line, a draw is the most likely outcome. If, on the other hand the line does not divide the distribution in two kind-of-equal halves, then one of the two teams have a higher chance of winning (as they are more likely to score more goals $-$ see for example the first plot for Netherlands-Australia).

These analyses translate into the estimation of the (model-based) chance of winning/drawing/losing the games for the two opponents. The situation seems a bit more clear-cut than it was for some of the games in round 1 $-$ England seems to be involved again in one of the closest games, although this time they seem to have an edge on Uruguay.

Finally, it probably now starts to make a bit more sense to think in terms of estimated chance of progressing through to the next stage of the competition. Assuming that the current form won't change from round 2 to round 3 (which again is not necessarily the most appropriate assumption), these are the estimated probabilities of progressing to the knockout stage.

As is obvious, these are quite different from the ones displayed here which were calculated based on "current form" before the World Cup was even played. More importantly, teams have now won/lost, which is of course accounted for in this analysis.

In Group B, The Netherlands are now big favourites to go through, which Chile slightly better off than Spain (but of course, things may change later today when these two meet). Italy are now hot favourites in Group D, where formerly most-likely-to-go-through Uruguay are left with a small chance of qualifying, after losing to low-rated Costa Rica. Portugal have also taken a big hit in terms of their chances of going through (that's Group G), after being heavily defeated by Germany, who are now clear favourites.

We'll keep updating the model and reporting (perhaps in more details) on how well we'd done in predicting the results for round 1 (I haven't had much time to do this, yet, but I will do in the next few days).

Friday 13 June 2014

The Oracle (2)

The World Cup is now under way, after an arguably fairly lacklustre performance by the host against a tough (if possibly a bit naive) Croatian team, still resulting in a 3-1 win for Brazil. I'll try and comment on our predictions for the first few games as they go along and the observed result is actually revealed.

So, our model predicted a very likely win for Brazil $-$ nearly 80% chance of this outcome and only around 5% chance of a Croatian win. In fact, if we look at the entire joint predictive distribution of the results (below), 1-0 and 2-0 wins were by far the most likely outcomes for the game $-$ incidentally, the latter coincided with the median values of the marginal distributions for the goal scored (see here).

The observed 3-1 was a bit further down the plot, meaning it was a less likely outcome $-$ but I think it's worth nothing that the game was quite tight at 2-1 (quite a likely outcome) until the Croatian keeper's mistake in the very last minutes of the game. Interestingly, 1-1 (the result at the time Brazil got awarded a soft penalty) would have been even more likely.

Here're the analyses for tonight's games. The graphs show the contour plot of the bivariate joint distribution of goals scored by the two opponents (Mexico-Cameroon; Spain-Netherlands; Chile-Australia) in the left hand side and the histogram (or technically a bar plot) showing how likely each of the possible results are (according to the predicted outcomes).

While probably more informative than the graphs we showed in our previous post, these are effectively in line with them. More comments later!

Tuesday 10 June 2014

The Oracle (1)

This is the first follow up to our previous, (slightly technical and detailed) post on the World Cup prediction. First off a quick and simplified recap on the model and then off with some prediction!

So: the idea is to use a collection of data to estimate a measure of "propensity" of a team $t$ to score when they play an opponent $s$. The estimation is done by considering a set of data on several types of games: in particular, we use the data on the last 6 World Cups and the last 4 years (including friendlies, qualifiers, continental finals and last year's Confederations Cup). We can use this "propensity" to predict the number of goals scored by any two teams in the next game they play against each other.

A crucial point is that the observed number of goals scored by $t$ and $s$ respectively will tend to be correlated (of course, this correlation may not be extremely large, but on average we can expect better teams to score more and concede fewer goals than lesser teams). We account for this correlation in our model by using structured ("random") team-specific effects.

Another important point is that how well a team will perform will reasonably depend (although not in a deterministic way!) by a measure of "current form" and, even more importantly, by the difference in this domain between the two teams. Because the World Cup is a relatively short and volatile event, in our prediction we start by estimating the 32 participants strength using published bookmakers' odds data.

So, based on the long-run trend (estimated using our model and the available data, covering a relatively long time horizon) and the current level of form/strength (estimated from a combination of data from the bookies), we could predict the number of goals scored in all the games. In fact, here're our predicted probabilities of progressing to the knockout stage (the first stage is played in 8 groups, each made by 4 teams, playing each other once; the top two teams go through).

In many of the groups, there is a relatively clear-cut situation. Brazil (Group A), Spain (Group B), Argentina (Group C) and Germany (Group G) are clear favourite to progress (and indeed to win their own groups). France seem to have the edge in Group E, where Ecuador and Switzerland should battle for the second available spot; Portugal shouldn't have too many problems qualifying in Group G; the Netherlands should overcome Chile and grab the second available spot in Group B. Group D is the most uncertain, with Uruguay, England and Italy all being estimated to have a very similar chance of progressing through. I think these results are in line with other models (which we also have mentioned here).

But this prediction doesn't make an awful lot of sense, I think. In fact, the main assumption underlying this model is that the strength/form of each team remains constant throughout the three games of the first stage. I don't really believe this assumption $-$ it seems to me that how teams approach the next game will strongly depend on how they've done in their previous one(s).

So what we'll do is to make predictions a bit at a time. The first stage can be effectively divided into three rounds, each made by 16 games; at the end of the first 16 games, all 32 teams will have played once; at the end of the second 16 games, all will have played twice; and of course by game 48 all have played their three games. Our strategy is to:

Predict the results of the first 16 games only, based on the long-run trend (which I'll tediously repeat, is estimated using our model and the available data, covering a relatively long time horizon) and the current level of form/strength (estimated from a combination of data from the bookies).
As the games are actually played and the realised results observed, we will be able to assess the performance of the model. More importantly, we will be able to update the form/strength variables from the current value (before the games are played) to a new value, which accounts for how well the first game went. In doing this, we'll account for how likely each result is predicted to be; for example (see below), Brazil are predicted to beat Croatia quite easily in the opening game. So if indeed they win, this should increase Brazil form/strength, but not by a massive amount, since this is just what we expected them to do. But if, on the other hand, Croatia won that game, then this should impact much more on both teams updated level of form (which in turn will impact on how they approach the next game).
Once the form variable has been updated, we will re-run our model adding the 16 observed results and use them to predict the next 16 (round 2 of the first stage).
Repeat steps 2-3 for the third round of the first stage.

Here are the predictions for the first batch of games.

The graph shows the distribution of the predicted number of goals scored by each team in each of the games. The dots indicate the means, while the lines are the 95% interval estimations. In the graph, we also report the median number of goals (eg the mean number of goals that Brazil are expected to score against Croatia is slightly over 2, as represented by the black dot on the bottom part of the graph, but the median is exactly 2 goals, as suggested by the text on the bottom-right corner).

The further apart the two strings of text reporting the label for the teams and the predicted median number of goals, the wider the difference between them (and so the more clear-cut the game, in terms of the prediction). I think the most interesting game is England-Italy (you probably think I would say that regardless, but I mean: look at the graph!); while Italy seem to have a tiny advantage in terms of the mean, the two distributions are effectively identical $-$ a very, very close game, on paper. Ecuador-Switzerland is pretty close too, but the South Americans seem to have a slightly higher mean.

In a sense, more of the same, if you look at the probability distributions for the three outcomes (team 1 win, team 2 win or they draw); again England-Italy is basically a Uniform distribution, while more or less, most of the other games show more unbalanced situations, with one of the two teams having a substantial higher probability of winning than their opponent.

Of course, we won't take responsibility for any money you will lose by betting on the games, based on our results!

Monday 9 June 2014

At the Copa

This is the first post of a fairly regular series (at least I'll try to keep it this way!), dedicated to the impending FIFA World Cup (you may think I've gone all Barry Manilow, like Peter & co $-$ but I can reassure you I haven't).

Marta, Virgilio and I have discussed this at the end of a barbecue at our place a few weeks back and have then done some work (in fact, Virgilio had prepared a different version of the model $-$ I'll mainly describe what Marta and I have worked on, but I'll also try and discuss Vir's analysis, at some point).

As I briefly mentioned in an earlier post, the main idea is to extend the model we developed to predict football results, for international games. Basically, we have considered a dataset of nearly 4,000 games, including the last six world cups (from Italia 90 onwards) and the last 4 years (including friendlies, qualifiers, continental finals and Confederation Cup). The rationale for including these games is that:

We wanted to have some direct evidence on World Cup games; we restricted to the last 6 WCs which all have a very similar format (with a first round of games and then a knockout stage, starting with a round of 16).
We also want to have evidence on most recent performances of the teams affiliated with FIFA.

I think it's worth noticing that many people (eg Goldman Sachs or 538) who have had a go at a similar exercise have decided to discard friendlies. But it seems to me that these may be indicative of some general trend and so we decided to keep them in our dataset.

The main outcome of our model is the number of goals scored by each team in each of the games. In particular, we want to account for the fact that the two counts are correlated. We model

$$y_i \sim \mbox{Poisson}(\theta_i)$$

where $y_i$ is the number of goals scored in game $i$. Note that we replicate the same game twice, looking at it from each of the two teams' perspective, respectively. So, for example, the first game in the dataset is Argentina-Cameroon, the opening game of Italia 90 and the first two rows in our dataset describe the game from the perspective of Argentina and Cameroon, respectively, like so:

Game	Team	Opponent	Goal	...
1	Argentina	Cameroon	0	...
1	Cameroon	Argentina	1	...

The "propensity" of the team in row $i$ to score when they play against the team in row $i+1$, which we indicate as $\theta_i$, is modelled as

$$\mbox{log}(\theta_i) = \mu + \beta_{\mbox{home}} \mbox{Home}_i + \beta_{\mbox{away}} \mbox{Away}_i + \beta_{\mbox{type}}\mbox{Type}_i + \beta_{\mbox{form}}\mbox{Form}_i + \gamma_{\mbox{team}_i} + \delta_{\mbox{oppt}_i}.$$ The linear predictor is made by:

The overall intercept $\mu$;
A set of unstructured effects $\beta$, accounting for the effect of i) playing at home, away or on neutral ground (Home and Away are dummies, so that when they are both 0, then the game is played on neutral ground); ii) the type of the game, which could be "finals" (World Cup or Continental tournament, or the Confederation Cup), or "other" (including friendlies and qualifiers); iii) the difference in recent forms between the two opponents $-$ this is computed by accounting the mean number of points obtained in the last two games played by each team. These are rescaled in the interval [0;1] so that the difference is a continuous variable defined in [-1;1] (a value of 1 indicates that a team is much more "in form" $-$ and thus potentially stronger $-$ than their opponent for that game);
A set of structured effects $\gamma_t$ and $\delta_t$ $-$ these are team-specific random effects, modelled as exchangeable. Effectively, they can be interpreted as "attack" (for the team) and "defence" (for the opponent) strength.

We fit a Bayesian model (using INLA to speed the computation up $-$ it runs quite smoothly and quickly; about 30 seconds on a medium range computer).

Then, for each of the future games (ie those to be played starting this week) we can simulate from the posterior distributions of the linear predictors $\theta_i$. When opportunely rescaled, these can be used to simulate the number of goals scored in the next games by the two teams playing.

In particular, given evidence that at the World Cups usually fewer goals are scored, we have inflated the chance of seeing a 0, so that the prediction is made as
$$y^{\mbox{new}}_i \sim (1-\pi)\mbox{Poisson}(\theta^{\mbox{new}}_i)$$
where $\pi$ is given an informative distribution based on the assumption that the chance of excess 0s in the observed number of goals at the World Cup is centered around 0.035 with a standard deviation of 0.02 $-$ in actual facts, we've tried a few alternatives, but this assumption does not affect the estimation/prediction massively.

In addition, for the yet-to-be-played games, the value of the "recent form" variable (which sort of determines the difference in strength between the two teams playing in a game $-$ we indicate this as $\omega_t$ and $\omega_s$ for teams $t$ and $s$, respectively) is based on an evidence synthesis of the available odds for each of the team, rather than on the actual past 2 games. It is easy to find data on lots of bookies offering odds for each of the 32 teams involved in the WC $-$ we've based our evidence synthesis on the 20 values found here.

I think there is a good reason for doing this: the last two games observed in the dataset are friendly games played in preparation for the finals. In those games, teams tend to train really, but do not give their 100%. On the other hand, the valuations of the bookmakers should be a more reliable indication of the actual relative strength of the teams involved at the moment.

Thus, we decided to use those values (starting from the odds, we built a simple log-Normal model and then rescaled the team-specific effects in the scale [0;1] to indicate the "recent form" of the teams). Under this sub-model, Brazil has a score of recent form of 0.97, closely followed by Argentina with a score of 0.96. The weakest team are Costa Rica, with a score of 0.044. If you're into graphical representations for models, here's one for you.

Rather than predicting all the way to the final based on the model and data available right now, we decided to take it step by step. I think there is a very good argument to do so. In fact, it is quite likely that recent form will be modified by the games that will be played in the course of the first round.

For example, group D (including Uruguay, Costa Rica and my personal derby of England and Italy) is arguably the closest with 3 relatively strong teams (538 seem to think so too). So, if Italy beat England on Saturday this surely will swing the odds in their favour, thus probably modifying quite massively the behaviour of the other teams in the next games (eg if England lost on Saturday, then they'll have to win their next game against Uruguay, while Italy may be happy to get a draw against Uruguay in their final game of the round, etc...).

So, in order to make more sense of the model, we'll only predict batches of games; before any game is played and based on the current model and data, we'll predict the first 16 games to be played $-$ that's when all the finalists will have played their first game. Then we will update the recent form variable based on the observed results and re-run the model to predict the next batch of 16 games. And then we'll repeat this step again and predict the final batch of 16 games for the group stage.

We'll take it from there and see how to carry the model forward into the knockout stage. I'll post some predictions, results, graphs in the next couple of days.

Sunday 8 June 2014

Enjoy the silence

I've been quite silent on the blog in the past few weeks $-$ a combination of exam-marking, conference-organisation and other few (some more, some less interesting) things...

As for Bayes Pharma, we're nearly there $-$ the conference is this week Wednesday to Friday. I've nearly got everything ready $-$ at least all I can think of, that is. We've arranged the social event (I'll post pictures, especially if the weather keeps good). The finalised programme with talks & titles is here. I'll post on how the conference progresses in the next few days.

On a totally different note, Marta and I have spent some time working on extending our football model to do some prediction for the impending World Cup. Lots of people have had a go (including some people at Goldman Sachs and others in the new Significance website $-$ eg here and here).

Unlike many, we won't try to predict the overall winner straight away $-$ I think that there are very good arguments to not doing that: 1) the stage group may significantly differ from the second stage; especially in a competition such as the WC, which is played over one month, the impact of current (or very, very recent form) can be dramatic.

So, what we'll do is:

Use past data (on the last 4 years of international games + the last 6 World Cups, in total about 3500 games) to fit an extended version of our model (which accounts for correlated "team" and "opponent" structured effects);
Use data on "current form" (based on the official bookmakers' odds) to predict the first round of games. The prediction can be assessed against the observed results which will become available in the first few games of the competition;
Update the variable of "current form" based on the observed results (so if a team unexpectedly win their first game their "form" is bumped up and they should be predicted to do better than they would have been with the previous data only). These new data can be used to predict the second batch of games (and so on for the third batch of data).

Once the group stage is over, we'll carry the predictions forward.

I'll post more details on the model and the results of the prediction exercise (including how well we're doing) in the next few days.

Gianluca Baio's blog