How Do Goals Change the Flow of a Soccer Game?

The World Cup is around the corner, and I know everyone can hardly wait. Between the extortionate train tickets, pride matches featuring hardline Middle Eastern countries, and peace prizes for people who are just about to start a war, it’s impossible to deny that this has been the most controversial and chaotic World Cup since four years ago.

In honor of the upcoming circus, I’m diving face-first into soccer analytics. Given the world’s obsession with the sport, attempting to do something novel in the field (pun intended) is a bit of a fool’s errand, but fortunately I am not above being a fool.

I often ponder how teams change their behavior depending on the situation of the match. Pundits sometimes call this “game management,” which is a polite way of saying “don’t screw up your lead.” The conventional wisdom is that most matches start as a tense standoff, then open up when the first goal goes in. After that, tactical patterns depend on the scoreboard and the clock – if you’re down one with seventy minutes left, no need to panic yet, but with ten minutes left, time to start going for broke. The golden question at the end of this rainbow is when the leading team should pull the plug on the entertainment and “park the bus.” We want to map out these phases, but there’s a fundamental puzzle with this entire line of analysis.

In statistics, we call something a “confounding variable” when it affects both a cause and an effect you’re trying to study. It’s a third, underlying factor that makes it tricky to sort out exactly which is which. Take the USTA’s claim that tennis is “the world’s healthiest sport.” Exercise is obviously better than nothing, but is it tennis specifically that extends your life? Or is it just that most people who play tennis happen to already be healthy, active, and – crucially – rich enough to afford the best medical care? In this case, socioeconomic status is a confounding variable. This tension is where dozens of silly “correlation is not causation” stories (and Malcolm Gladwell chapters) spring from. Ice cream sales and shark attacks are highly correlated, but that doesn't mean mint chocolate chip is chum; it just means it’s hot outside.

What does this have to do with soccer? Our core question is how teams behave differently when they’re winning or losing — for instance, does a team that’s up by one goal generate more chances than a team that’s down by two? But there’s a big confounding variable here, which you’re probably starting to notice: being better than the other team. Better teams are more likely to be in the lead, because they are better. They’re also more likely to create more quality chances, also because they’re better. If the other team was so much better, they’d be the ones in the lead! So we can’t just look at the average production of teams who are leading versus teams who are trailing. To be boring about it, if we want to properly identify the effect of game situation on playing style, we have to control for the confounding variable of team strength.

We’ll start with a brief discussion of event data and expected goals, focusing on the power of shot-by-shot analysis. Ultimately, we’ll build a Bayesian hierarchical model, the industrial-strength way to untangle chance creation and quality from the underlying fact that some teams are just better than others. The results reveal some curious insights about how game flow shifts after each goal – and how, perhaps unsurprisingly, the best way to score a second goal is to have already scored the first one.

The data

Our data comes from the Statsbomb Open Data project, which contains minute-by-minute event data (passes, shots, fouls, and so on) for a handful of selected matches. In soccer, large summary datasets (match results, etc.) are easy to find, but raw event data is mostly stuck behind paywalls – just like every other form of good entertainment in this world – with a few providers charging massive licensing fees to people with money, such as national teams and top clubs and not me.

The generous folks at Statsbomb gave us a peek into the archives; the most complete set of matches they included is all of the big five leagues (England, Italy, Spain, Germany, and France) for the 2015-16 season. Given that the data’s ten years old, none of the below should be taken as team-specific tactical advice; every team has gone through a full squad rotation by now, not to mention seven different coaches. But the modeling approach still applies.

At this point, expected goals (xG) are sufficiently widespread that they no longer qualify as “advanced analytics”. A team gets 0.1 xG for a shot that had a 10% chance of going in (based on its type and location), 0.26 xG for a 26% chance, exactly 0.7835 xG for a penalty kick, and so on. Compared to the fluky nature of so-called “actual goals”, expected goals are a far more accurate reflection of who deserved to win. There’s an excellent book by Rory Smith (with that exact title!) on the history of data in soccer, tracing how xG went from a fringe innovation, to a popular tool, to something that TV pundits can discuss with the reasonable expectation that fans will understand.

To explore how match situation affects team tactics, a good starting point is xG distribution – and specifically, how easily it lies. Suppose Arsenal scores on Tottenham in the 10th minute, then takes their foot off the gas, and Tottenham spends the next 80 minutes taking increasingly hopeless shots from range. You might end up with a 1-0 win for Arsenal, but an xG margin of 1.1 to 0.4 in favor of Tottenham. Looking at those numbers in a vacuum, you’d say that Tottenham deserved to win but got unlucky. In reality, Arsenal’s xG was only so low because they already had the goal they needed. Recent research has tried to address this directly, adjusting xG totals with respect to the scoreline as the game progresses.

But adjusting the totals still doesn't capture differences in game flow. For that, you have to look at both the quantity and the quality of attacks. In other words, if all you know is that a team had 0.6 xG in the second half, you can’t say whether they desperately fired off 20 wild shots for 0.03 xG each, or methodically built three high-quality chances for 0.20 xG each. We need to separate the rate at which a team shoots from the value of each individual attempt.

To explore how these numbers change, the first step is to establish some baselines. When we look at the distribution of shot quantities in our dataset, based on the number of non-penalty shots taken per team per game, we get a negative binomial distribution with an average of 12.4 shots per 90 minutes. (For more discussion on the key statistical distributions at play in soccer analytics, check out the appendix.)

And when we look at the quality of those shots? We get a distribution that’s surprisingly close to lognormal – mostly low-value fluff, with a long, thin tail of high-quality chances. In our data, this distribution has a mean of 0.05 xG.

Now, we know what we’re looking for. If a team that’s trailing by one goal starts to shoot at a rate faster than 12.4 shots per 90 minutes, then they’re picking up the pace, but if those chances are worse than 0.05 xG on average, then that means they’re not necessarily more effective, just more desperate.

The model

We know that our observed data is driven by hidden underlying parameters, like game situation and team strengths. To map this out, we’ll apply a Bayesian hierarchical approach. At a high level, this involves defining the relationships between all the variables we care about, indicating the unknown parameters, and then running a massive guess-and-check against reality. If my underlying parameter is X, how likely is it that I’d see this exact data? What if it's Y? What if it's Z? Over thousands of samples, we end up with a distribution of the most likely actual values for the hidden states.

You could spend days (or in the case of my PhD, years) arguing the finer points of Bayesian inference. The short version is that while it's harder to build and fine-tune than a simpler model, the payoff is massive. We can flexibly define how variables interact, isolate the effect of each specific input, and incorporate reasonable prior information to guide the guessing. For the heavy lifting, we use Stan, a probabilistic programming language that interfaces directly with Python or R. You write out the full mathematical skeleton in a Stan file, hand it the data, and let it start guessing.

For this model, we are tracking the following parameters. Our analysis hinges on the final two: after controlling for overall baselines and team strength, how does the scoreboard change a team's behavior? Do trailing teams shoot more often, and if so, is the quality of those shots actually any good? We cap the situations to teams trailing or leading by two or more goals, as the sample sizes get too small to trust beyond that. (Full technical details on the model structure, priors, etc. can be found in the appendix.)

Baseline Values: The average shot quantity (negative binomial) and shot quality (lognormal) per game phase. This is the anchor point against which all other game situations are measured. Mathematically, it represents a tied game between mid-table teams.
Team Attacking Strength: How much each team boosts their own shot quantity and quality compared to the baseline, independently of game situation.
Team Defending Strength: How much each team suppresses their opponent’s quantity and quality, again independently of game situation.
Situation Effect on Quantity: Multipliers for four scorelines (down 2 or more, down 1, up 1, up 2 or more) to see how they affect the tied baseline rate.
Situation Effect on Quality: Multipliers for those same scorelines to see how they affect the tied baseline value.

The results

The plot below shows the tactical shifts for each scoreline, relative to the reference case of a tied game, with point estimates and 90% confidence intervals.

If you look exclusively at quantity (in blue), a trailing team looks just as dangerous as a team in the lead. Compared to a tied game, teams that are behind produce 12-13% more chances, the same as teams up by one. Teams that are up by two or more produce 24% more chances. From this alone, you might think desperation can be effective.
But when we look at quality (in orange), we see the hidden difference in mechanics. When teams fall behind, the quality of their chances is statistically indistinguishable from the baseline — i.e. they shoot more, but the shots aren’t better. Meanwhile, teams up by one produce chances that are 16% better on average, and for teams up by two or more, that spikes even further to 23%.
When we combine those two dials into the final panel at the bottom (total expected goals), the difference is dramatic. Compared to the tied baseline, teams that are trailing produce 15% more xG. But teams that are up one produce 31% more xG, due to the higher quality of their chances, and teams that are up two or more produce a staggering 52% more xG.

As part of the model, we also had to fit team-specific offensive and defensive strengths. One of the cool bonuses of a joint Bayesian model is that your control variables are real results in their own right, so we can plot them too, instead of just throwing them out. (Recall that these numbers are from 2015-16, so please don’t use them to place bets this weekend.) The usual suspects top the rankings, while the wooden spoon across all five leagues goes to Frosinone, who have been yo-yoing between Serie A and Serie B ever since.

What does it all mean?

The quantity results back up one piece of conventional wisdom: a tied game is as cagey as it gets. The idea that a goal "opens up the game" isn’t a myth. Tied games have the slowest rate of chance creation by far; when the scoreline is unbalanced, both teams immediately start taking shots at a higher clip.

But the quality results reveal that those chances are not created equal. Teams that are trailing maintain the same average distribution of xG per shot as teams that are tied. Remember, these are elite European professionals — when they fall behind, they don’t completely lose their discipline and start blasting prayers from midfield. They still build up chances to a reasonable baseline level before going for goal. But because the trailing team is forced to push forward and take more risks, the leading team gets to sit back, absorb the volume, and exploit the empty space. When you’re up one, your opponent has to press, opening the door to disproportionately high-quality counterattacks. When you’re up two, that effect multiplies. Offense isn't necessarily the best defense, but defending a lead certainly creates the best offense.

My high school coach used to say that the hardest thing to do in soccer is to score a goal. But the math tells a slightly different story: the hardest thing to do in soccer is to score the first goal. Because once you’re winning, the second one is a lot easier.

Appendix

Code and data are available on Github. I'm always happy to discuss collaboration ideas; my contact information can be found under the CV tab on my website.

First, a brief review of the key distributions involved in soccer forecasting. The fundamental theorem of soccer (and life?) is that everything is a Poisson distribution. If you’re trying to forecast match odds, you could do a lot worse – and honestly not a lot better – then a “double Poisson” model using each team’s average goals. The canonical approach to soccer forecasting is the Dixon-Coles model, a classic two-namer like the Black-Scholes equations or the Duckworth-Lewis method. It amounts to a double Poisson with a few tweaks for accuracy: inflating the likelihood of a low-scoring draw (a not-so-crazy insight for two English researchers in the 1990s) and upweighting more recent results when determining team strengths.

Modern research often applies fancy machine leaning to refine the estimate…of the lambda parameters, which are then plugged into a double Poisson model. The Royal Statistical Society’s Euro 2020 prediction contest was won by a cleverly refined double Poisson. Sometimes, you just can’t beat the classics.

So why do we use a negative binomial instead of a Poisson for our shot counts? The issue is overdispersion — shots don't truly arise independently, but often in clusters of two or three as the ball pings around the box. The Poisson distribution forces an equal mean and variance, but the negative binomial adds a dispersion parameter to allow for greater real-world variance. Specifically, by using the standard NB2 parameterization, we apply our situation multipliers directly to the expected mean while leaving the global dispersion constant. This allows the variance to scale quadratically as the volume increases. In plain English: it captures the reality that desperate, high-volume games get disproportionately more chaotic than slow, low-volume games.

Second, some more details on the Bayesian model. The complete Stan file can be found at the Github link above. We use uninformative priors and light parameter bounds, which we can get away with since the dataset isn't too large, and fit a standard four chains with 500 burn-in and 1000 sampling iterations, which takes about three minutes. For simplicity, we assume that a few key functional relationships are universal across all matches. This includes the scaling factor between each team's attacking quantity and quality; the dispersion parameter of the negative binomial (as mentioned above); and the standard deviation of the log-normal quality distribution.

A quick note on omitted variables. The double-edged sword of Bayesian modeling is that you can always cram in more parameters, but every variable you add makes the whole contraption harder to precisely fit, or in technical terms, less identifiable. It’s like trying to level a table with twelve adjustable legs: tweak one, and you invariably throw three others out of balance. Slicing the data into finer buckets to find more nuanced differences necessarily means that each bucket has a smaller sample size.

However, every variable you exclude can also be interpreted as a baked-in assumption. For this model, two obvious missing pieces are home-field advantage and league-specific weights. By leaving them out, we are implicitly assuming that a 1-0 home lead in the Premier League triggers the same tactical shift as a 1-0 away lead in Serie A, which is almost certainly false. If you were building a general predictive model, like the one in this excellent paper, you’d need to add those legs back to the table, along with other variables for penalties, red cards, and so on. But this analysis is focused on match situation, and our core results appear robust enough that they would almost definitely hold up even after accounting for these omissions.

How Do Goals Change the Flow of a Soccer Game?

Keep Reading

Confusion Matrix