With the Super Bowl around the corner, football is gripping the minds of millions. Should the Patriots collect another Lombardi trophy, it will stimulate further discussion about where the team ranks among the great teams in history. The internet is chock-full of “greatest teams of all time” lists, most of them based on little more than someone’s subjective opinion.
But one popular list from Nate Silver’s 538 project is based on objective math rather than subjective takes. This may tempt many readers to conclude that it must necessarily be a superior method of ranking teams. It is not. Although 538’s Elo system is a handy, intuitive tool, it produces ratings that are too flawed to compare historically great teams to one another with meaningful precision.
538 describes its Elo metric as evaluating “a team’s skill level at any given moment.” It fails to do that, for reasons I will explain below.
A disclaimer first: none of the following should be misinterpreted as an attack on 538 or Nate Silver. My attitude towards Silver and 538 lies somewhere between substantial respect and outright fandom. He has performed great public service by popularizing the appreciation of statistics, and by rationalizing understanding of American politics. Silver himself warrants particular respect for sticking to his guns in projecting Donald Trump’s 2016 election as unlikely yet still possible, when too many other forecasters were treating Hillary Clinton’s expected victory as a foregone conclusion.
First some background. 538’s Elo system is loosely based on a similar rating system used in chess. If two chess players have the same rating, the system expects them to win an equal number of games against each other, in addition to any games in which they draw. As long as they perform at an equal level, the equality of their ratings shouldn’t change. Thus, if either player beats the other they are given the same amount of points for doing so, while each loses that same amount of points if they’re defeated. That way their ratings won’t ultimately change as long as they continue to play equally well and win with equal frequency.
When a higher-rated player plays a lower-rated player, however, the formula makes an adjustment. If lower-rated players win, that’s a bigger surprise, so their ratings will rise (and the higher-rated players’ will drop) by more points. If higher-rated players win, that’s more in line with expectations, so their ratings rise by fewer points. Consequently, the higher-rated player needs to beat the lower-rated player most of the time to maintain ratings advantage. The bigger the gap between two players’ ratings, the more often the better-rated player is expected to win.
538’s Elo system operates on a similar principle. Two teams of different ratings enter a game; Elo makes an implicit prediction of who will win and by how much, depending in part on who has home field advantage. If a team’s performance beats that expectation, its rating goes up. If it beats that expectation by a lot, its rating goes up by more.
It’s a quick-and-dirty system. It’s based essentially on point-scoring differentials and home-field adjustments. It doesn’t know about personnel changes, key injuries, or the extent to which a final game score might be deceptive. I’m not against quick-and-dirty systems; I like them. I would rather consult a system like this relying on concepts people intuitively understand than one that crunches unseen data in opaque ways. When young, I developed a comparably simple system for team evaluation in an effort to perform better in football pools (disclosure: it didn’t work very well). Unfortunately, the Elo system just isn’t that useful for doing the very thing it seeks to do – i.e., measure a team’s “skill level at any given moment.”
To understand why, imagine a hypothetical great team that goes 16-0 over the course of a season, winning every game in dominant fashion. Imagine further that it performs equally well every game, and that the team hadn’t been nearly as good in previous seasons. The Elo system would start the year with a guess as to the team’s quality (based on last year’s rating, adjusted for expected regression to the mean). As the team kept winning, its Elo rating would rise each week. It would have climbed a little bit after going 1-0 and 2-0, and it would have climbed a lot by the time it went 16-0. It would end the year with a much higher Elo rating than it started.
At some level, this makes intuitive sense. Going 16-0 is a lot more impressive than going 2-0. Vanishingly few teams that start 2-0 will end up 16-0. But in our example, the team did not actually play better toward the end of the year than at the beginning; we deliberately devised the example to be of a team playing equally well every week. What we have at the end of the year is more information about the team being great, rather than the team actually playing better. And that is a critical distinction. If you consulted the Elo rating alone, you would wrongly conclude that the team was more dominant at the end of the season than it was at the beginning – which, in our example, isn’t true.
This methodological shortcoming is a fatal flaw when it comes to historical rankings like the ones 538 publishes, for the simple reason that the Elo system carries over a good portion of a team’s rating from one year to the next (with the aforementioned off-season regression adjustment). As a result, it has a substantial bias toward overrating teams that are nearing the end of a run of greatness, relative to teams at the beginning of their run.
Again, to understand this, imagine a hypothetical dynastic team that dominates football for three seasons, and which plays equally well each season. The Elo system gives them a lower starting rating at the beginning of season 1 than it does season 3. As a result, it will rank the third season as the best, even if seasons 1 and 2 were equally good. Granted, a three-season dynasty is more impressive than a one-season peak, and it’s also true that we have more information after three seasons to judge a team’s greatness than we had after one. But this is different from showing that the team was actually better in the third season. An historically accurate rating system should not single out the third season as the best one, any more than it should single out the first.
One needn’t inspect 538’s historical Elo ratings (based on a blend of annual mean, peak and final ratings) for long to discover evidence that this shortcoming is a real problem. It’s immediately apparent in how Elo ranks teams in back-to-back seasons. Consider the 2013 and 2014 Seattle Seahawks:
2013 Seahawks: 13-3, Won Super Bowl
2014 Seahawks: 12-4, Lost Super Bowl
The 2013 edition of the Seahawks was the stronger of the two teams. Their record was better (13-3 vs 12-4). Their point differential was superior (186 vs 140). They played a tougher regular season schedule. They crushed Denver in one of the most decisive Super Bowl routs of all time, 43-8. The 2014 team not only lost the Super Bowl, it had a somewhat less impressive regular season. Yet Elo ranks the 2014 Seahawks as the greater team – the 22nd strongest team of all time -- with the 2013 Seahawks ranked 28th. This is entirely because Elo assigned the 2013 Seahawks a season-starting rating of 1600, whereas the 2014 edition started out at 1679. The 2014 edition wasn’t better; it’s only that Elo had better information that the Seahawks would be competitive coming into that season.
Elo makes the same mistake with the 1977-78 Dallas Cowboys:
1977 Cowboys: 12-2, Won Super Bowl
1978 Cowboys: 12-4, Lost Super Bowl
The 1977 edition of the Dallas Cowboys went 12-2. They easily stomped their way through post-season, beating the Bears 37-7, the Vikings 23-6, and the Broncos 27-10 in a Super Bowl that wasn’t even that close. The 1978 edition was also quite good but not as dominant. They went 12-4 in the regular season and were challenged by Atlanta in the playoffs (27-20) before losing to Pittsburgh in the Super Bowl. Elo finds the 1978 team to be the 40th greatest of all time, the 1977 team the 44th. Why does Elo think the 1978 edition was slightly better? Only because they started the season with an initial-guess rating of 1670, as opposed to the prior season’s initial guess of 1583. Were it not for a greater error in that 1977 initial guess, the 1977 team would be seen clearly as the superior one.
Elo even produces this effect with respect to the 1972 Dolphins, one of the legendary teams of all time, which won the Super Bowl after an undefeated season:
1972 Dolphins: 14-0, Won Super Bowl
1973 Dolphins: 12-2, Won Super Bowl
The 1973 Dolphins were indeed a great team, and performed even better in post-season than they had in 1972. However, they lost twice in the regular season. The primary reason Elo finds the 1973 Dolphins to be the 6th greatest team of all time (consigning the legendary 1972 team to the rank of 17th) was that the 1973 edition started their year with an initial ranking of 1685 in the afterglow of the undefeated season, which they had started out at only 1583.
A clinching piece of evidence of this systemic problem is that if the 1972 Dolphins’ performance were moved to 1973, and the 1973 Dolphins’ performance moved to 1972, Elo would change its mind about which performance was better. The 1972 performance would be rated as greater if it had occurred later. The system would similarly change its mind about the relative strengths of the 1977-78 Cowboys and the 2013-14 Seahawks, simply by changing the dates on which they performed.
Because of this shortcoming, many of the Elo ratings are hard to justify. Elo considers the second greatest team in NFL history to be the 2004 Patriots. Those Patriots were indeed a great team; they weren’t the second-best team of all time. Compare, for example, the 2004 Patriots to the 1985 Chicago Bears, who happen to be ranked by ESPN’s Page Two, its readers and some others as the greatest-ever team:
2004 Patriots: 14-2, Won Super Bowl
1985 Bears: 15-1, Won Super Bowl
The 2004 Patriots were a genuinely excellent team and a strong Super Bowl champion. They went 14-2, and they scored 177 more points than their opponents during the regular season before edging the Philadelphia Eagles 24-21 in the Super Bowl. But they were not the 1985 Chicago Bears. Those Bears went 15-1 and outscored their regular-season opponents by 258 points before blasting through post-season in a manner rarely seen – posting two playoff shutouts and winning their three post-season games by a combined score of 91-10. The Bears’ 46-10 drubbing of New England in the Super Bowl remains one of the most brutal bludgeonings in championship history. And yet Elo considers the 2004 Patriots the better team, almost solely because they started their season with a lofty 1667 rating, whereas the 1985 Bears started down at 1542.
If you’re still not convinced, consider that these are by no means the most extreme examples of problematic Elo rankings. Combing through the ratings unearths some tellingly poor comparative evaluations:
1992 Redskins: 9-7, Lost Divisional Playoff
1999 Rams: 13-3, Won Super Bowl
Elo ranks the 1992 Redskins the 198th best team of all time, the 1999 Rams the 201st. That Rams team was the so-called “Greatest Show on Turf,” going 13-3, outscoring their opponents by a whopping 284 points, and taking the Super Bowl. The 1992 Redskins posted an unimpressive 9-7 record during the regular season. They got a nice win in the wild card game over an 11-5 Vikings team, but were simply not in the same class as the 1999 Rams. Elo thinks the Redskins were better solely because it had no idea at the start of the 1999 season how good the Rams would be – assigning them a starting rating of 1408.
Or consider these two teams:
1970 Colts: 11-2-1, Won Super Bowl
2005 Patriots: 10-6, Lost Divisional Playoff
Elo believes the 2005 Patriots (ranking: 206) were better than the 1970 Colts (ranking: 209). Those Colts were Super Bowl winners after posting their conference’s best record by a comfortable margin. By contrast there were actually five teams in the 2005 AFC with better records than the Patriots, who were decisively knocked from the playoffs in a two-touchdown loss at Denver. This bizarre result again derives from where the Elo ratings stood at the respective season beginnings: the Pats started 2005 at a gaudy 1713, a carry-over from their 2004 season (which, per above, Elo also overrates).
One of the most absurd comparative evaluations deriving from the Elo ratings is this one:
1981 49ers: 13-3, Won Super Bowl
2005 Chargers: 9-7, No Post-season Appearance
The 2005 Chargers were a fairly undistinguished team, going 9-7 and finishing 3rd out of 4 teams in the AFC West. They failed to make the playoffs. The 1981 49ers of Joe Montana and Ronnie Lott went 13-3, closing out their season by winning the first of their five Super Bowls. Which team was better? Elo believes the Chargers were – ranking them 261 all time, the 49ers 267.
538’s Elo system is best understood as a fun statistical toy; it’s easy to grasp and it’s enjoyable to examine the results. It suffers from substantial flaws, however, one of which is its conflation of the accumulation of positive statistical information about a team, with team performance quality itself. Because of this flaw, the system treats a great team that evolves from a good one very differently from a great team that evolves from a bad one. All in all, the Elo system is just not meaningfully useful for comparing the historical greatness of different NFL teams.
Charles Blahous, a contributor to E21, holds the J. Fish and Lillian F. Smith Chair at the Mercatus Center and is a visiting fellow at the Hoover Institution. He recently served as a public trustee for Social Security and Medicare.
Interested in real economic insights? Want to stay ahead of the competition? Each weekday morning, E21 delivers a short email that includes E21 exclusive commentaries and the latest market news and updates from Washington. Sign up for the E21 Morning Ebrief.