Cinderella stories? A College of Charleston student examines March Madness upsets through math

Drew Passarello, a student at the College of Charleston, takes a closer look at how math relates to upsets and predictability in March Madness.

balls

The Madness is coming. In a way, it is here! With the first round of the March Madness tournament announced, the craziness of filling out the tournament brackets is upon us! Can math help us get a better handle on where we might see upsets in March Madness? In this post, I will detail how math helps us get a handle on what level of madness we expect in the tournament. Said another way, how many upsets do we expect? Will there be a lot? We call that a bad year as that leads to brackets having lower accuracy in their predictions. By the end of the article, you will see how math can earmark teams that might be on the cusp of upsets in the games that will capture national attention.

Where am I learning this math? I am taking a sports analytics class at the College of Charleston under the supervision of Dr. Tim Chartier and Dr. Amy Langville. Part of our work has been researching new results and insights in bracketology. My research uses the Massey and Colley ranking methods. Part of my research deals with the following question: What are good years and bad years in terms of March Madness? In other words, before the tournament begins, what can we infer about how predictable the tournament will be?

One way of answering this question is to see how accurate one is at predicting the winners of the tournaments coupled with how high one’s ESPN score is. However, I also wanted to account for the variability of the level of competition going into the tournament, which is why I also looked at the standard deviation of the ratings of those in March Madness. A higher standard deviation implies the more spread out the playing level is. Ultimately, a good year will have a high tournament accuracy, high ESPN score, and a high standard deviation of ratings for those competing in March Madness. Similarly, a bad year will have low tournament accuracy, low ESPN score, and a low standard deviation of the ratings. This assessment will be relative to the ranking method itself and only defines good years and bad years solely in terms of past March Madness data.

I focused on ratings from uniformly weighted Massey and Colley ranking methods as the weighting might add some bias. However, my simple assessment can be applied for other variations of weighting Massey and Colley. I found the mean accuracy, mean ESPN score, and mean standard deviation of ratings of the teams in March Madness for years 2001 – 2014, and I then looked at the years which rested below or above these corresponding means. Years overlapping were those deemed to be good or bad, and the remaining years were labeled neutral. The good years for Massey were 2001, 2004, 2008, and 2009, and the bad years were 2006, 2010 – 2014. Neutral years were 2002, 2003, and 2007. Also, for Colley, the good years were 2005, 2007 – 2009; bad years were 2001, 2006, and 2010 – 2014; neutral years were 2002 – 2004. A very interesting trend I noticed from both Massey and Colley was that the standard deviation of the ratings of those in March Madness from 2010 to 2014 were significantly lower than the years before. This leads me to believe that basketball has recently become more competitive in terms of March Madness, which would also partially explain why 2010 – 2014 were bad years for both methods. However, this does not necessarily imply 2015 will be a bad year.

In order to get a feel for how accurate the ranking methods will be for this year, I created a regression line based on years 2001 – 2014 that had tournament accuracy as the dependent variable and standard deviation of the ratings of those in March Madness as the independent variable. Massey is predicted to have 65.81% accuracy for predicting winners this year whereas Colley is predicted to have 64.19%accuracy. The standard deviation of the ratings for those expected to be in the tournament was 8.0451 for Massey and 0.1528 for Colley, and these mostly resemble the standard deviation of the ratings of the March Madness teams in 2002 and 2007.

After this assessment, I wanted to figure out what defines an upset relative to the ratings. To answer this, I looked at season data and focused on uniform Massey. Specifically for this year, I used the first half of the season ratings to predict the first week of the second half of the season and then updated the ratings. After this, I would use these to predict the next week and update the ratings again and so on until now. For games incorrectly predicted, the median in the difference of ratings was 2.2727, and the mean was 3.0284. I defined an upset for this year to be those games in which the absolute difference in the ratings is greater than or equal to three. This definition of an upset is relative to this particular year. I then kept track of the upsets for those teams expected to be in the tournament. I looked at the number of upsets each team had and the number of times each team gets upset, along with the score differential and rating differences for these games. From comparing these trends, I determined the following teams to be upset teams to look for in the tournament: Indiana, NC State, Notre Dame, and Georgetown. These teams had a higher ratio of upsets over getting upset when compared to the other teams. Also, these teams had games in which the score differences and rating differences were larger than those from the other teams in March Madness.

I am still working on ways to weight these upset games from the second half of the season, and one of the approaches relies on the score differential of the game. Essentially, teams who upset teams by a lot of points should benefit more in the ratings. Similarly, teams who get upset by a lot of points should be penalized more in the ratings. For a fun and easy bracket, I am going to weight upset games heavily on the week before conference tournament play and a week into conference tournament play. These two weeks gave the best correlation coefficient in terms of accuracy from these weeks and the accuracy from March Madness for both uniform Massey and Colley. Let the madness begin!