Tuesday, November 6, 2012

Back To Basics: Normal Football and Skewed Distributions

On October 31, 2002, Madagascan side AS Adema cruised to an easy win against archrivals SOE (Olympique) Antananarivo. When all was said and done at the Stade Olympique I’Emyrne that day, Adema had won 149-0. No, that’s not a typo. Adema managed one goal every 36 seconds of the match, except they also had close to zero percent possession of the ball. As the BBC reported,
“… it was not their outstanding skill that led to the outlandish scoreline. It was because Olympique deliberately scored one own goal after another in protest over a refereeing decision. Radio Madagascar reported that Olympique began banging the ball into their own net after their coach Ratsimandresy Ratsarazaka lost his temper with the referee. Fans told the station that after the row between coach and official at Adema’s home ground in the port of Toamasina, the visitors directed each kick-off directly towards their own goal. Adema’s players simply stood around looking bemused as their opponents self-destructed.”
Even though there were 22 players on the pitch that day, along with a ball and a referee, the Adema-Antananarivo match had little to do with football as we know it. By repeatedly and intentionally scoring as many own goals as they could manage, Antananarivo had transformed the game from a football match and a contest against the other side into a contest of their own skill against the clock (and giving the referee the one-fingered salute in the process). Without opponents, teams would score as much as 90 minutes would allow – as the Adema-Antananarivo match revealed, we can assume that to be about 150 times per match, give or take, at a relatively even clip (quite a feat, actually, given the time it takes to fish the ball out of the net and return it to the center spot for a kick off).

The fact that the football played in the Madagascan match was barely recognizable as such makes it interesting for football analytics. It tells us things about football and football numbers that are worth keeping in mind as we dig more deeply (or before we dig deeply) into more complex data.

For right now, let's talk about two: sample size and distributions.

Sample size

The Madagascan match was but one match. It's the most extreme outlier we found in the ocean of football data – the most unusual match and score line – and by some distance. So it may seem obvious that generalizing about the game from too small a sample – in this case one match – is extremely hazardous. But it's a good lesson to heed and a lesson too easy to forget.

Data are plural. I know we all know, but it matters. The power of statistics is the power of (large) numbers. That is, more data are better than fewer data, and less data require special care and attention.

After all, if we tried to draw conclusions about football from this one match, we would assume that teams win by having the other side do all the scoring, and that they best the other side by about 150 goals per game. And yet, pundits rarely seem to understand what mathematicians call the "law of large numbers"; instead, they often like to draw grand conclusions about what is "normal" (for a team or a player or a referee) by extrapolating from a single match or just a handful of occasions. From what we can tell, coaches and analysts aren't completely free from the temptation to over-interpret a small number of cases either.

The good news is that even the dimmest observer would have known that the Adema-Antananarivo match was unusual.

This leads to the next logical question, though: how do we know what is and isn't "unusual"? The answer is fairly simple: to know what qualifies as "normal football" and "outlier football", we need to how common different values of football variables are in nature, out on the pitch. That is, we need to understand the distribution of our data, and for that we need more data, not less.


Once we look at lots of matches rather than a single one played in Madagascar or Manchester, we notice pretty quickly that there are some general patterns to what happens in football.  And curiously, some of the stuff that matters the most in football also is the least "normal", statistically speaking. To appreciate this, let’s think of goal production. First, teams need to maneuver the ball into the vicinity of the opponents' goal; second, they need to shoot; third, they need to find the target; and finally, they need to convert accurate shots into goals.

Of course, the real world on the pitch is more complex than this (for example, the quality, velocity, distance, and angles of shots matter a great deal, too, as does the other side). But to see what a truly tall order it is to score and how much football data morph from being normally distributed to being noticeably skewed, take a look at the distribution of overall shots, accurate shots, and goals created in Premier League matches by teams over the course of three seasons.  The graphs below are histograms - they show how common (in percentage terms; the y-axis) different numbers of shots and goals were for individual teams (the x-axis). To help you compare, we have kept both axes on the same scale.

Football is clearly a kind of organized profligacy: the odds of successfully generating goals shrink noticeably at each stage of the production process. At the top of the graph, the distribution of shots looks like a bell-shaped distribution: in most matches, most teams take somewhere between 10 and 15 shots, so the average is reasonably high; but there are some long tails, with quite a few matches when teams took fewer than 10 or exceeded 20 shots overall. The distribution looks wide and relatively flat. Now compare this to the distribution of shots on target; the average number drops noticeably and the distribution of accurate shots is much more condensed and leans slightly to the left. Bounded at 0, most teams managed around 5 shots on target, but to the right we see a few that produced more than 10. As we move through the goal production process from shot to goal creation, the distribution of positive outcomes becomes ever more skewed and the odds move noticeably toward zero.

Finally, as we take a look at the distribution of goals in matches, the two most common categories are 1 and 0, followed by 2. The picture of this distribution doesn’t look anything like the neat bell-shaped or what statisticians call "normal" or Gaussian distribution of shots. Instead, it looks very much like the so-called Poisson distribution. Goals are rare and certainly not “normal” (in the statistical sense).

When we take these individual match numbers of shots, accurate shots, and goals – of which there were 32,789, 10,396, and 2,954 across the three seasons we collected data for – and put them in relation to each other, it turns out that the odds of any one shot actually being on target was 32%, while the odds of an accurate shot finding the back of the net was similarly around 30 % (28% to be exact). Plenty of teams shoot enough to score, but very few of them consistently score.

Clearly, "normal" football isn't always normally distributed. As a general rule, the more common an event on the pitch is, the more the distribution looks like a bell-shaped curve (graph the frequency of passes per match and you'll see what I mean). This has important implications: using some of the most common statistical techniques to deal with these data may be problematic, standard (canned) versions of techniques like correlations and linear regression assume normally distributed data. The stuff we care about the most - goals - is the least "normal" of all the events above. But as importantly, think about what the picture above tells us: there is enormous slippage from one stage of the goal production process to the next. Understanding why and how this slippage occurs should be important questions for any budding analyst.

But the most fundamental lesson is often the hardest: before you go looking for fancy ways of analyzing your complex dataset, understand the nature of your variables; having a feel for their shape and nature in a large enough sample of observations is critical to extracting useful information.

So before you go running off with high-end econometrics to torture you data till they confess, try to walk through them with care and see if they'll reveal themselves willingly.