Thinking ahead to the last three rounds of World Cup play about to begin on Friday, I was reminded of a former colleague who used to say that “Without data, you’re just another guy with an opinion.” In fact, he loved the quote so much that he put it in his email signature. Talk about faith in data!
You’d think that a website devoted to understanding the beautiful game with numbers would embrace the sentiment. You’d be wrong - at least when it comes to saying something new or useful about who will advance in the tournament beyond the group stage and eventually win this thing.
You’d think that a website devoted to understanding the beautiful game with numbers would embrace the sentiment. You’d be wrong - at least when it comes to saying something new or useful about who will advance in the tournament beyond the group stage and eventually win this thing.
Here’s why: it is very difficult to use numbers (statistics) to make very meaningful predictions at this late stage of the tournament. Part of it has to do with the nature of the data themselves. By the time you get to the quarterfinals in a tournament of 32 teams that emerged from years of qualifying, there is very little variation left among the teams still competing in the tournament. They're all great teams, compared to the almost 200 that didn't make it: they are obviously in very good form, usually have experience playing in World Cups, and they tend to be big and rich countries. So if we want to apply any kind of prediction model to these kinds of teams, the teams just aren't different enough from one another to say anything with certainty about who will win.
Take the Soccernomics model, for example, which tells us that wealth, population, experience and home advantage matter. At this point in the tournament, we’re talking about countries with a very reduced range on these major factors (when compared to all countries or even all countries in this year's World Cup). And from a statistical vantage point, you can explain outcomes that vary among teams (wins, points, goals, etc.) only with things that vary themselves. Or, to use statistics-talk, you can’t explain a variable with a constant (by definition something that doesn't vary). Practically speaking, this means that using the kinds of variables that are typically used to make predictions about the quarterfinals or who will eventually win produces the obvious set of countries like Brazil, Germany, or the Netherlands.
But what about Paraguay or Ghana, you may ask. Good question. They obviously are smaller and poorer than the other major soccer powers that typically make it to the quarterfinals and beyond. But this is exactly why they are long shots – something we already know. But aside from this, they are not all that different in the scheme of things. For example, Ghana's and Uruguay's FIFA rankings of 30 and 32 before the tournament puts them in the top 15% of all countries ranked on the FIFA index (there are 202 countries in all). They are hardly Papua New Guinea (with apologies to Papua New Guinea).
So when you’re faced with very little variation (in the grand scheme of things) on the factors we think matter to international soccer success, we are left with making obvious predictions based on things like rankings. And these predictions will vary ever so slightly, depending on the assumptions we make going into the statistical analysis. Now, I wouldn’t label well-founded assumptions mere “opinions”, but it is important to remember that what comes out of the model is affected by what you put in.
Let me give you a couple of examples. In an earlier post, I made predictions about who will make it to the quarterfinals, based on FIFA ranking and past World Cup experience. We can use the same model to make predictions about who will make it to the semifinals. Using these semifinals predictions to make a forecast about the quarterfinal matches to be played on Friday and Saturday gives the following results:
According to these odds, Uruguay, Brazil, and Spain go through, and the Germany-Argentina match is anyone’s guess.
But what if we based the predictions model only on teams that made it to the second round in 1998, 2002, and 2006? This will give you the following predictions:
The substantive predictions are pretty much the same, with the exception that Germany is now tipped to beat Argentina. But the odds have changed quite a bit. Now, Spain's are about three times Paraguay's, whereas before the difference was much smaller. And overall, the differences in odds among all the teams have increased slightly, so we seem to be more certain of our predictions of winners and losers.
Now, what if we simply use the pre-tournament FIFA rankings to predict semifinalists? This will give you the following odds, by team and match:
Turns out, this is pretty similar to the other predictions, with the Germany-Argentina match moving to the toss-up category again. So, depending on which model you trust, Germany will win, or we're not sure.
Guess what? These sets of predictions are similar to what forecasting guru Nate Silver of 538.com predicts or that most people would predict, quite frankly, looking at fan discussions. So at this point in the tournament, the numbers game becomes a contest of predictions models, like the models used by political scientists and economists to forecast presidential (and other kinds of) elections.
More importantly, these results are what most people without numbers would predict. It’s good to have data to support your opinion, I suppose, but at this point, your guess is as good as mine (and they’re likely to be the same).
So how can statistics be helpful aside from producing predictions the guy at the corner pub will be able to make? Turns out, statistics do offer a clue why the predictions game is so much fun this late in the tournament: First, if you think about it, the odds we are talking about aren’t all that high in absolute terms – 35% (in the case of Brazil, Germany, or Argentina, depending on the model), or even Nate Silver’s 60-some % for Brazil and Spain. These predictions are a long way from 100%. So keep in mind that, overall, you’re actually more likely to be wrong than right betting on these horses.
Second, the predictions tell us the average outcome we would expect, based on all kinds of assumptions about the data distribution (they are so-called point predictions), but not specific events. And these forecasts come with a so-calledconfidence interval, which tells you that the true number would fall within a certain range most of the time (usually, we use 95% of the time) were we to collect ever more data.
For World Cup predictions, these confidence intervals are large – and therefore, our predictions are uncertain. For example, the prediction that Brazil will reach the semis (based on FIFA rankings alone) is .304. But the 95% confidence interval for this particular prediction ranges from as low as .096 to as high as .511. This means that we are 95% certain Brazil’s odds of reaching the semis fall somewhere in this range. But is this really saying much? (The same is true for virtually any predictions I have been able to generate.)
So, keep in mind what these numbers really can and cannot do for you. (A great read on this, especially as it relates to soccer is Zach Slaton’s “Statistics Are Just Numbers” on his A Beautiful Numbers Game blog). For the purposes of World Cup predictions this late in the tournament, I’d paraphrase my colleague’s take on data: “Sometimes even with data, you’re just another guy with an opinion.” Or: there’s nothing like watching teams beat the odds. And my money is still on Brazil, and my heart is rooting for Germany.













