In Dialogue: Christopher Phillips and Tim Chartier on Sports & Statistics

Question: How would you describe the intersection between statistics and sports? How does one inform the other?

Christopher Phillips, author of Scouting and Scoring: Sports have undoubtedly become one of the most visible and important sites for the rise of data analytics and statistics. In some respects, sports seem to be an easy, even inevitable place to apply new statistical tools: most sports produce a lot of data across teams and seasons; games have fixed rules and clear measures of success (e.g., wins or points); players and teams have incentives to adjust in order to gain a competitive edge.

But as I discuss in my new book Scouting and Scoring: How We Know What We Know About Baseball, it is also easy to fall prey to myths about the use of statistics in sports. Though these myths apply across many sports, it is easiest to hone in on baseball, as that has been one of the most consequential areas for statistics.

Perhaps the most persistent and pernicious myth is that data emerge naturally from sporting events. There is no doubt that new video-, Doppler-, and radar-based technologies, especially when combined with increasingly cheap computing power and storage capability, have dramatically expanded the amount of data that can be collected. But it takes a huge about of labor to create, collect, clean, and curate data, even before anyone tries to analyze them. Moreover, some data, like errors in baseball, are inescapably the product of individual judgment which has to be standardized and monitored.

The second myth is that sport statistics emerged only recently, particularly after the rise of the electronic computer. In fact, statistical analysis in sports goes back decades: in baseball, playing statistics were being used to evaluate players for year-end awards and negotiate contracts for as long as professional baseball has existed. (And statistics were collected and published for cricket decades before baseball’s rules were formalized.) As new methods of statistical analysis emerged in the early twentieth century in fields like psychology and physiology, some observers immediately tried to apply them to sports. In the 1910 book Touching Second, the authors promoted the use of data for shifting around fielders and for scouting prospects, two of the most important uses of statistical data in the modern era as well. There’s certainly been a flurry of new statistics over the last twenty years, but the general idea isn’t new—consider that Allen Guttmann’s half-century-old book From Ritual to Record, highlights the “numeration of achievement” and the “quantification of the aesthetic” as defining features of modern sport.

Finally, it’s a myth that there is a fundamental divide between those who look at performance statistics (i.e., scorers) and those who evaluate bodies (i.e., scouts). The usual gloss is that scouts are holistic, subjective judges of quality whereas scorers are precise, objective measurers. In reality, baseball scouts have long used methods of quantification, whether for the pricing of amateur prospects, or for the grading of skills, or the creation of single metrics like the Overall Future Potential that reduce a player to a single number. There’s a fairly good case to be made that scouts and other evaluators of talent are even more audacious quantifiers than scorers in that the latter mainly analyze things that can be easily counted.

Tim Chartier, author of Math Bytes: Data surrounds us. The rate at which data is produced can make us seem like specks in the cavernous expanse of digital information.  Each day 3 billion photos and videos are shared on Snapchat.  In the last minute, 300 hours of video were uploaded to YouTube.  Data is offering new possibilities for insight. Sports is an area where data has a traditional role and newfound possibilities, in part, due to the enlarging datasets. 

For years, there are a number of constants in baseball that include the ball, bat, bases, and statistics like balls, strikes, hits and outs.  Statistics are and have simply been a part of the game.  You can find from the 1920 box score that Babe Ruth got 2 hits in 4 at-bats in his first game as a Yankee. While new metrics have emerged with analytical advances, the game has been well studied for some time. As Ford C. Frick stated in Games, Asterisks and People,

“Baseball is probably the world’s best documented sport.”

While this is true, the prevalence of data does not necessarily result in trusting the recommendations of those who study it.  For example, Manager Bobby Bragen stated, “Say you were standing with one foot in the oven and one foot in an ice bucket. According to the percentage people, you should be perfectly comfortable.”  This underscores an important aspect of data and analytics.  Data, inherently, can lead to insight but it becomes actionable when one trusts in how accurately it reflects our world. 

Other sports, while not as statistically robust as baseball also have an influx of data.  In basketball, cameras positioned in the rafters report the (x,y) position of every player on the court and the (x,y,z) position of the ball throughout the entire game every fraction of a second.  As such, we can replay aspects of games via this data for years to come.  With such information comes new information.  For example, we know that Steph Curry, while averaging just over 34 minutes a game, runs, on average, just over 2.6 miles per game. He also runs almost a quarter of a mile more on offense than defense. 

While such data can be stunning with its size and detail, it also comes with challenges. How do you recognize a pick and roll versus an isolation play simply from essentially dots moving in a plane?  Further, basketball, like football but unlike baseball, generally involves multiple players at a time.  How much credit do players get for a basket on offense?  A player’s position may open up possibilities for scoring, even if that player didn’t touch the ball.  As such, metrics have been and continued to be created in order to better understand the game.

Sports are played with a combination of analytics, gut and experience.  What combination depends on the sport, player, coach and context.  Nonetheless, data is here and will continue to give insight on the game. 

March Mathness 2015: The Wrap Up

balls

The champion has been crowned! After an eventful and surprising March Madness tournament, Duke has been named the new NCAA national champion.

A year of bragging rights goes to PUP paperbacks manager Larissa Skurka (98.6 percent) and PUP executive math and computer science editor Vickie Kearn (98.4 percent), who took first and second place in our ESPN bracket pool. Congrats to both! Check out all of the results here.

As we wrap up March Mathness, here are two final guest posts from basketball fans who used math and Tim Chartier‘s methods to create their brackets.

 Swearing by Bracketology

By Jeff Smith

My name is Jeff Smith, and I’ve been using Tim Chartier’s math algorithms to help with my March Madness brackets for several years now. I met Tim when we were traveling the ‘circuit’ together in creative ministries training. You may only know Tim for his math prowess, but I knew him for his creativity before I knew he was a brilliant mathematician. He and his wife, Tanya, are professional mimes, and his creativity is genius too.

Several years ago, he mentioned his method for picking brackets at a conference where we were doing some training together. He promised to send me the home page for his site and I could fill out my brackets using his parameters and formula. I was excited to give it a shot. Mainly, because I am part of a men’s group at our church that participates in March Madness brackets every year. Bragging rights are a big deal…for the whole year. You get the picture.

Also, I have two boys who did get one of my genes: the competitive edge. I sat down and explained the process. Because they did not know Tim, they were a little more skeptical, but I promised it wouldn’t hurt to try. That year, in a pool of 40+ guys, we all finished in the top ten. We were all hooked!

Since then, I have contacted Tim each year and reminded him to send me the link to his site where I could put in our numbers to fill out our brackets. Generally, the three of us each incorporate different parameters because we have different philosophies about the process. It has become a family event, where we sit around the dinner table; almost ceremonially, and we take our output and place them in the brackets. The submission is generally preceded by trash talking, prayer, and fasting. (Well, probably not the fasting, because we fill up with nachos and chips during the process.)

Jeff post

Men of March Mathness: Jeff, Samuel, Ben Smith.

This year, I was in South Africa on a mission trip during the annual ritual. Thank God for video chatting and internet access. Halfway across the world, we were still able to be together and place our brackets into the pool. It was such a wonderful experience. While my boys veered from the path, picking intuitively instead of statistically, I didn’t stray far. (I was strong!) If it wouldn’t have been for Villanova, whom I will never choose again in a bracket, I would be leading the pack. But, I’m still in the top ten of the men’s bracket at my church, with an outside shot of winning. In the Princeton bracket, I’m doing even better because I stayed away from the guessing game a little more.

I do not follow college basketball during the season. I’m from central Pennsylvania, and Penn State doesn’t have a good basketball team. So, I have no passion for the basketball season. Periodically, I’ll watch a game because my boys are watching, but generally, basketball season is the long wait until baseball season. (Go Pirates!) So, March Mathness has saved my reputation. It makes me look like a genius. Other guys in the group are looking at my bracket for answers. My boys and I are sworn to secrecy about the formula. The only reason I write this is because I’m sure none of them read this blog! But I’m thankful for Tim and the formula and the chance to look good in front of friends. I have never won the pool, however, if you factor my finishes over the course of the years I have been using Tim’s formula, I have the best average of all the guys.

 

 What Do Coaches Have to Do with It?

By Stephen Gorman, College of Charleston student

PUPSelfie2

It’s that time of year again. The time of year when everyone compares brackets to see who did the best. But if your bracket was busted early, don’t worry — you’re not the only one. In fact, nobody came out of the tournament with a perfect bracket.

The unpredictability of these games is an inescapable fact of March Madness. This tournament is so incredibly unpredictable that some people are willing to give out billions to anyone who can create a perfect bracket; Warren Buffett is one of these people. So is he crazy? Or does he realize your odds of creating a perfect bracket are 52 billion times worse than winning the Powerball. In layman’s terms – if you think playing the lottery is crazy, trying to create the perfect bracket is insane.

However, once you can accept the statistics, predicting March Madness becomes a game of bettering you’re odds – and there are many predictive models that can help you out along the way. Some of these models include rating methods, like the Massey method, which takes into account score differentials and strength of schedule. In addition to this, there are weighting methods that can be applied to rating methods; these take into account the significance of particular games and even individual player statistics. However, I noticed there is one thing missing from these predictive models: a method that quantifies the value of a good coach. In order to take into account the importance of a coach, a fellow researcher (John Sussingham) and I decided to create our own rating system for coaches.

Using data available from SportsReference.com, we made a system of rating that incorporated such factors as the coach’s career win percentage, March Madness appearances, and the record of success in March Madness. But before we implemented it, we wanted to justify that it was, indeed, a good way to quantify the strength of a coach. In order to do this, we tested the coach ratings in two ways. The first way being a comparison between how sports writers ranked the top 10 College Basketball coaches of all time and what our coach ratings said were the best coaches of all time. The second way was to test how the coach ratings did by themselves at predicting March Madness.

The comparison of the rankings are shown in the table below:

Rank Our Results CBS Sports Results Bleacher Report Results
1 John Wooden John Wooden John Wooden
2 Mike Krzyzewski Mike Krzyzewski Bobby Knight
3 Adolph Rupp Bob Knight Mike Krzyzewski
4 Jim Boeheim Dean Smith Adolph Rupp
5 Dean Smith Adolph Rupp Dean Smith
6 Roy Williams Henry Iba Jim Calhoun
7 Jerry Tarkanian Phog Allen Jim Boeheim
8 Al McGuire Jim Calhoun Lute Olson
9 Bill Self John Thompson Eddie Sutton
10 Jamie Dixon Jim Boeheim Jim Phelan

It is clear from the table above that there are striking similarities between all three rankings. This concluded our first test.

For the second test, we decided to use the coach ratings to predict the last fourteen years of March Madness. The results showed that over the last fourteen years, on average, coach ratings had 68.4 precent prediction accuracy and an ESPN bracket score of 946. As a comparison, the uniform (un-weighted) Massey method of rating (over the same timespan) had an average prediction accuracy of 65.2 precent and an average ESPN bracket score of 1006. Having a higher prediction accuracy, but lower ESPN bracket score essentially means that you have predicted more games correctly in the beginning of the tournament, but struggle in the later rounds. This comes to show that not only are these ratings good at predicting March Madness, but they stand their ground when compared to the effectiveness of very popular methods of rating.

To conclude this article, we decided that, this year, we would combine both the Massey ratings and our Coach ratings to make a bracket for March Madness. Over the last fourteen years, the combination-rating had an average prediction accuracy of 66.33 percent and an average ESPN bracket score of 1024. It’s interesting to note that while the prediction accuracy went down from just using the Coach ratings, the ESPN bracket score went up significantly. Even more interestingly, both the prediction accuracy and the ESPN Bracket score were better than uniform Massey.

This year, the combination-ratings had three out of the four Final Four teams correctly predicted with Kentucky beating Duke in the Championship. However, the undefeated Kentucky lost to Wisconsin in the Final Four. Despite this, the combination-ratings bracket still did well, finishing in the 87.6th percentile on ESPN.

Using math for March Madness bracket picks

The countdown to fill out your March Madness brackets is on! Who are you picking to win it all?

Today, we hear from Liana Valentino, a student at the College of Charleston who works with PUP authors Amy Langville and Tim Chartier. Liana discusses how math can be applied to bracket selection.

court chalk

What are the chances your team makes it to the next round?

The madness has begun! Since the top 64 teams have been released, brackets are being made all over the country. As an avid college basketball fan my entire life, this is always my favorite time of the year. This year, I have taken a new approach to filling out brackets that consist of more than my basketball knowledge, I am using math as well.

To learn more about how the math is used to make predictions, information is available on Dr. Tim Chartier’s March Mathness website, where you can create your own bracket using math as well!

My bracket choices are decided using the Colley and Massey ranking methods; Colley only uses wins and losses, while Massey integrates the scores of the games. Within these methods, there are several different weighting options that will change the ratings produced. My strategy is to generate multiple sets of rankings, then determine the probability that each particular team will make it to a specific round. Using this approach, I am able to combine the results of multiple methods instead of having to decide on one to use for the entire bracket.

Choosing what weighting options to use is a personal decision. I will list the ones I’ve used and the reasoning behind them using my basketball awareness.

(1)

Winning games on the road should be rewarded more than winning games at home. Because of that, I use constant rates of .6 for a winning at home, 1.6 for winning away, and 1 for winning at a neutral location; these are the numbers used by the NCAA when determining RPI. I incorporate home and away weightings when performing other weighting methods as well.

(2)

Margin of victory is another factor, but a “blow out” game is defined differently depending on the person. With that in mind, I ran methods using the margin of victory to be both 15 and 20. This means if the margin of victory if 15, then games with a point differential of 15 or higher are weighed the same. These numbers are mainly from personal experience. If a team wins by 20, I would consider that a blowout, meaning the matchup was simply unfair. If a team loses by 15, which in terms of the game is five possessions, the game wasn’t necessarily a blow out, but the winning team is clearly defined as better than the opposition.

In addition to this, I chose to weight games differently if they were close. I defined a close game as a game within one possession, therefore three points. My reasoning behind this was if a team is blowing out every opponent, it means those games are obviously against mismatched opponents, so that does not say very much about them. On the other hand, a team that constantly wins close games shows character. Also, when it comes tournament time, there aren’t going to be many blow out games, therefore teams that can handle close game situations well will excel compared to those who fold under pressure. Because of this, I weighted close games, within three points, 1.5, “blow out” games, greater than 20 points, .5, and any point differential in between as 1.

(3)

Games played at different points in the season are also weighted differently. Would you say a team is the same in the first game as the last? There are three different methods to weight time, as provided by Dr. Chartier using his March Mathness site, linearly, logarithmically, and using intervals. Linear and logarithmic weights are similar in the fact that both increase the weight of the game as the season progresses. These methods can be used if you believe that games towards the end of the season are more important than games at the beginning.

Interval weighting consists of breaking the season into equal sized intervals and choosing specific weightings for each. In one instance, I weighted the games by splitting the season in half, down weighting the first half using .5, and up weighting the second half using 1.5 and 2. These decisions were made because during the first half of the season, teams are still getting to know themselves, while during the second half of the season, there are fewer excuses the make. Also, the second half of the season is when conference games are played, which are generally considered more important than non-conference games. For the people that argue that non conference play is more important because it is usually more difficult than in conference play, I also created one bracket where I up weight the first half of the season and down weight the second half.

(4)

The last different weighting method used was incorporating if a team was on a winning streak. In this case, we would weight a game higher if one team breaks their opponents winning streak. Personally, I defined a winning streak as having won four or more games in a row.

I used several combinations of these various methods and created 36 different brackets that I have used to obtain the following information. Surprisingly, Kentucky only wins the tournament 75% of the time; Arizona wins about 20%, and the remaining 5% is split between Wisconsin and Villanova. Interestingly enough, the only round Kentucky ever loses in is the Final Four, so each time they do make it to the championship, they win. Duke is the only number 1 seed never predicted to win a championship.

Villanova makes it to the championship game 70% of the time, where the only team that prevents them from doing so is Duke, who makes it 25% of the time. The remaining teams for that side of the bracket that make it are Stephen F. Austin and Virginia, both with a 2.5% chance. Kentucky makes it to the championship game 75% of the time, while Arizona makes it 22%, and Wisconsin makes it 3%. However, if Arizona makes it the championship game, they win it 88% of the time. Furthermore, Wisconsin is predicted to play in the championship game once, which they win.

The two teams Kentucky loses to in the Final Four are Arizona, and Wisconsin. During the final four, Kentucky has Arizona as an opponent 39% of the time, where Arizona wins 50% of those matchups. Kentucky’s only other opponent in the final four is Wisconsin, where Wisconsin wins that game only 5% of the time. On the other side, Villanova makes it to the final four 97% of the time, where the one instance they did not was a loss to Virginia. Villanova’s opponent in the Final Four is made up of Duke 72%, Gonzaga 19%, Stephen F. Austin 6%, Utah at 3%. The only seeds that appear in the Final Four are 1, 2, and one 12 seed, Stephen F. Austin one time.

During the Elite 8, Duke is the only number 1 seed that does not make it 100% of the time, with Utah upsetting them in 17% of their matchups. The other Elite 8 member is Gonzaga 97% of the time. Kentucky’s opponent in this round is Notre Dame 47% and Kansas 53% of the time.

In the Sweet 16, there are eight teams that make it every time: Kentucky, Wisconsin, Villanova, Duke, Arizona, Virginia, Gonzaga, and Notre Dame. Kansas is the only number 2 seed not on the list as Wichita State is predicted to beat them in 8% of their matchups. Kentucky’s opponent in the Sweet 16 is Maryland 39%, West Virginia 36%, Valparaiso 14%, and Buffalo 11%. Valparaiso is the only 13 seed predicted to make it to the Sweet 16. Villanova’s opponent is either Northern Iowa 61% or Louisville 38%. Duke appears to be facing either Utah 67%, Stephen F. Austin 19%, or Georgetown 14%.

Now, for the teams that make it into the third round. I’m not sure how many people consider a 9 seed beating an 8 seed an upset, but the number 9 seeds that are expected to progress are Purdue, Oklahoma State, and St. John’s. In regards to the 10 seed, Davidson is the most likely to continue with a 47% chance to move past Iowa, which is the highest percentage for an upset not including the 8-9 seed matchups. Following them is 11 seed Texas, who have a 42% of defeating Butler. For the 12 seeds, Buffalo is the most likely to continue with a 36% chance of beating Virginia. The 13 seed with the best chance of progressing is Valparaiso with 19% over Maryland. Lastly, the only 14 seeds that move on are Georgia State and Albany, which only happens a mere 8% of the time.

In general, Arizona seems to win the championship when using Massey and linear or interval weighting without home and away. This could be because most of their losses happen during the beginning of the season, while they win important games towards the end. Using the Colley method is when most of the upsets are predicted. For example, Stephen F. Austin making it to the championship game happens using the Colley logarithmic weighting. Davidson beating Iowa in the second round is also found many times using different Colley methods.

Overall, there are various methods that include various factors, but there are still qualitative variables that we don’t include. On the other hand, math can do a lot more than people expect. Considering Kentucky is undefeated, I presumed the math would never show them losing, but there is a lot more in the numbers than you think. Combining the various methods on 36 different brackets, I computed the probabilities of teams making it to specific rounds and decided to make a bracket using the combined data. This makes it so I don’t have to decide on solely one weighting that determines my bracket; instead, I use the results from several methods. Unfortunately, there is always one factor we cannot consider, luck! That is why we can only make estimates and never be certain. From my results, I would predict to see a Final Four of Kentucky, Arizona, Villanova, Duke; a championship game of Kentucky, Villanova; and the 2015 national champion being Kentucky.

 

 

May the odds be in your favor — March Mathness begins

Let the games begin! After the excitement of Selection Sunday, brackets are ready for “the picking.” Have you started making your picks?

Check out the full schedule of teams selected yesterday, and join the fun by submitting a bracket to the official Princeton University Press March Madness tournament pool.

Before you do, we recommend that you brush up on your bracketology by checking out PUP author Tim Chartier’s strategy:

 

 

For more on the math behind the madness, head over to Dr. Chartier’s March Mathness video page. Learn three popular sport ranking methods and how to create March Madness brackets with them. Let math make the picks!

Be sure to follow along with our March Mathness coverage on our blog, and comment below with your favorite strategy for making March Madness picks.

March Mathness Winner

Davidson College student, Jane Gribble, was our March Mathness winner this year. Below she explains how she filled in her bracket.

 


 

Gribble

I love basketball – Davidson College basketball. As a Davidson College cheerleader I have an enormous amount of school pride, especially when it comes to our basketball team. However, outside of Davidson College I know little to nothing about college basketball. I knew that UNC Chapel Hill was having a tough season because this is my sister’s alma mater. Also, I knew that New Mexico, Gonzaga, Duke, and Montana were all likely teams for the NCAA tournament because we had played these non-conference teams during our season and these were the most talked about non-conference games around campus. My name is Jane Gribble. I am a junior mathematics major and this is the first year I completed a bracket.

In Dr. Tim Chartier’s MAT 210 – Mathematical Modeling course we discussed sports ranking using the Colley method and the Massey method. We were given the opportunity to apply our new knowledge of sports ranking in the NCAA Tournament Challenge. Since Davidson College was participating in the tournament my focus was on one game, the Davidson/Marquette game in Lexington, KY. When we traveled to KY I thought I had missed my opportunity to fill out a bracket, but one of my classmates was also traveling for the game with the Davidson College Pep Band and had the modeling program on his computer. We completed our brackets in the hotel lobby in Kentucky the night before our game.

My bracket used the Massey method because in previous years it has had better success than the Colley method. I decided to submit only one bracket, a bracket solely based on math (partially because I know little about college basketball). As a cheerleader and a prideful student it upset me to have Davidson losing against Marquette the following night, but I wasn’t going to let a math model crush my personal dreams of success in the tournament.  The home games were weighted as .5 (it would have been 1 if it was an unweighted model) to take into account home court advantage. Similarly, away games were weighted as 1.5 and neutral games as 1. Also, the season was segmented into 6 equal sections. I believe games at the end of the season are more important than games at the beginning of the season because teams change throughout the year and the last games give the best perspective of the teams going into the tournament. There was no real reason for the numbers chosen, other than they increased each segment. The 6 equal sections were weighted: .4, .6, .8, 1, 1.5, and 2. With these weights in the Massey method my model correctly predicted the Minnesota upset, but missed the Ole Miss, LaSalle, Harvard, and Florida Gulf upsets.

After Davidson’s tragic loss I could not watch anymore basketball for a while. I even forgot that my bracket was in the competition. I only started paying attention to the brackets when a friend in the same competition congratulated me on being second going into the Elite 8; my math based bracket was in the top 10 percent of all the brackets. Once he told me my bracket had a chance of winning, I paid attention to the rest of the games to see how my bracket was doing in the competition. After Davidson’s loss against Louisville last year in the tournament I never wanted to cheer for Louisville. To my surprise, I went into the final game this year cheering for Louisville because my model had Louisville winning it all. I was not cheering for Louisville because of any connections with the team, but was cheering to receive a free ice cream cone, a prize that our local Ben and Jerry’s donates to the winner of  Dr. Chartier’s class pool.

Next year I hope to compete in the NCAA tournament challenge again. This year I greatly enjoyed the experience and want to continuing submitting brackets for the tournament. Next year I will submit one bracket that uses the exact weightings of my bracket this year to see how it compares from year to year. This year I wanted to submit a math bracket that looked at teams who had injuries throughout the season. My motivation for this was Davidson’s player Clint Mann. Clint had to sit out many games towards the end of the season because of a concussion, but he had recovered in time for the NCAA tournament. I thought that our wins during the time without Clint showed our strengths as a team. Unfortunately this year I ran out of time to code this additional weighting. Hopefully next year my submissions will include a bracket using the weights from this year, a bracket that includes weights for teams with injured team members, and another bracket with varying weights.