As promised, some consideration of our model performance, so far. I've produced the graph below, which for each of the first 16 games (ie the games it took for all the 32 teams to be involved once) shows the predictive distribution of the results. The blue line (and the red label) indicate the result that was actually observed.
I think that the first thing to mention is that, in comparison to simple predictions that are usually given as "the most likely result", we should acknowledge quite a large uncertainty in the predictive distributions. Of course, we know well that our model is not perfect and part of it may be due to its inability to fully capture the stochastic nature of the phenomenon. But I think much of this uncertainty is real $-$ after all, often games really are decided by episodes and luck.
For example, we would have done much better in terms of predicting the result for Switzerland-Ecuador if the final score had been 1-1, which it was until the very last minute of added time. Even an Ecuador win by 2-1 would have been slightly more likely according to our model. But seconds before Switzerland winner, Ecuador had a massive chance to score themselves $-$ had they taken it, our model would have looked a bit better...
On the plus side, also, I'd say that in most cases the actual result was not "extreme" in the predictive distribution. In other words, the model often predicted a result that was reasonably in line with the observed truth.
Of course, in some cases the model didn't really have a clue, considering the actual result. For example, we couldn't predict Spain's trashing at the end of the Dutch (the observed 1-5 was way on the right tail of the predictive distribution). To a lesser extent, the model didn't really see Costa Rica's victory against Uruguay coming, or the size of Germany's win against Portugal. But I believe these were (more or less) very surprising results $-$ I don't think many people would have bet good money on them!
May be, as far as our model is concerned, the "current form" variable (we talked about this here) is sort of non-perfectly calibrated. Perhaps, we start with too high a value for some of the teams (eg Spain or Uruguay, or even Brazil, to some degree), when in reality they were not so strong (or at least as stronger than the rest). For Spain especially, this is probably not too surprising: after all, in the last four World Cups the reigning champions have been eliminated straight after the first round in three occasions (France in 2002, Italy in 2010 and now Spain $-$ Brazil didn't go past the quarter finals in 2006, which isn't great either).
If one were less demanding and were to measure the model performance on how well it did in predicting wins/losses (see the last graph here) instead of the actual score, then I think the model would have done quite OK. The outcome of 9 out of 16 games was correctly predicted; in 4 cases, the outcome was the opposite as predicted (ie we said A win and actually B won $-$ but one of these is the infamous Switzerland-Ecuador, one is the unbelievable Spain defeat to the Netherlands and another was a relatively close game such as Ghana-USA). 2 were really close games (England-Italy and Iran-Nigeria) that could have really gone either way, but our uncertainty was clearly reflected in nearly Uniform distributions. In one case (Russia-South Korea), the model had predicted a relatively clear win, while a draw was observed.