Monday 7 July 2014

The Oracle (8) - let's go all the way!

This is (may be) the final post in the series dedicated to the prediction of the World Cup results $-$ I'll try and actually write another to wrap things up and summarise a few comments, but this will probably be a bit later on. Finally, we've decided to use our model, which so far has been applied incrementally, ie stage-by-stage, to predict the result of both the semifinals and the finals.

The first part is relatively straightforward; the quarter finals have been played and we do know the results that have occurred. Thus, we can re-iterate the procedure (which we described here) and i) update the data with the observed results; ii) update the "current form" variable and the offset; iii) re-run the model to estimate each team's propensity to score; iv) predict the result of the unobserved games $-$ in this case the two semifinals (Brazil-Germany and Argentina-Netherlands).

However, to give the model a nice twist, I thought we should include some piece of extra information that is available right now, ie the fact that Brazil will, for certain, play their semifinal without their suspended captain Thiago Silva and their injured "star player" Neymar (who will also miss the final, due to the gravity of his injury). Thus, we ran the model by modifying the offset variable (see a more detailed description here) for Brazil, to slightly decrease their "short-term" quality. [NB: if this were a "serious" model, we would probably try to embed these changes in a more formal way, rather than as "ad hoc" modifications to the general set up. Nevertheless, I believe that the possibility of dealing with additional information, possibly in the form of subjective/expert knowledge, is actually a strength of the modelling framework. Of course, you could say that the selection of the offset distribution is arbitrary and other possibilities were possible $-$ that's of course true and a "serious" model would certainly require more extensive sensitivity analysis at this stage!]

Using this formulation of the model, we get the following results, in terms of the overall probability of going through to the final (ie accounting for potential draws in the 90 minutes and then extra times and possibly penalties, as discussed here):

Brazil       Germany      0.605  0.395
Argentina Netherlands  0.510  0.490

So, the second semifinal is predicted to be much tighter (nearly 50:50), while Brazil are still favourites to reach the final, according to the model prediction.

As I said earlier, however, this time we've gone beyond the simple one-step prediction and have used these results to also re-run the model before the actual results of the semifinals are known and thus predict the overall outcome, ie who's winning the World Cup. 

Overall, our estimation gives the following probabilities of winning the championship (these may not sum to 1 because of rounding):

Brazil: 0.372
Germany: 0.174
Argentina: 0.245
Netherlands: 0.206

Of course, these probabilities encode extra uncertainty, because we're going one extra step forward in the future $-$ we don't know which of the potential futures will occur for the semifinals. Leaving the model aside), I think would probably like the Netherlands to win $-$ if only for the fact that in that way, Italy would still be the 2nd most frequent World Cup winners, only one title behind Brazil, and one and two above Germany and Argentina, respectively. 

No comments:

Post a Comment