Introducing nflscrapR – Part 1

By: Maksim Horowitz (@bklynmaks)

Introducing nflscrapR

While searching the web for a viable source of NFL data for exploration and analysis, I noticed that there was no such source readily available for my desired insights. After browsing through my go-to websites (such as football-reference and ESPN) I realized that none of that data could be used to extract the meaningful insights that many data-enthused and statistically inclined football analysts and fans were hungry for.

After discussing this issue with a number of my peers, I discovered an API maintained by NFL.com that has player, drive, and play-by-play information across whole seasons. If you’re thinking “too good to be true”, then you’re right. The data is stored in a JSON format and it’s messier than you could even imagine! So let the data wrangling begin right? Wrong, we have already done it for you.

NFL fans and sports enthusiastic alike, I would like to introduce the nflscrapR package, an American football data aggregator that will scrape, clean, and parse play-by-play data across games, seasons, and careers.

The package includes detailed and clean football datasets for immediate use in football analytics and developing insights in the NFL (similar to what we see in the MLB, NBA, and NHL). This package was created to allow for open-source development, standardized data usage, and reproducible football analytics research, something that teams need and fans crave.

Functionality of nflscrapR

There are 11 functions stored in the nflscrapR package: nine produce dataframes primed for analysis and two are helper functions used in scraping. Probably the two most interesting functions in the package are the play-by-play parsing functions and the player-game functions.

  • game_play_by_play is a function that scrapes and parses play-by-play data from a specified game. Each game has a unique gameID, which is used as an input for this function and spits out a 61-column dataframe with detailed information on each play.

– The season_play_by_play function outputs the same dataframe, but aggregated over an entire season. You simply input the year associated with a given NFL season and within minutes you have a detailed dataset of every play in your desired season.

  • playergame allows a user to gather all the measurable statistics of each player involved in a given game. Yes, that means this function will tell you about Peyton Manning’s kick return habits and Marshawn Lynch’s passing tendencies. Simply input the gameID associated with your game of choice and the dataset is created (there is a function in the package that helps you find game IDs!)

season_playergame generates the same dataset as playergame except across the entire season. So all players in all 256 games across a season will have one row for their statistics for each game (i.e. if Joe Flacco records a pass in all 16 games, then he will have 16 rows in this dataset).

agg_playergame generates a dataset with season total statistics. It uses season_playergame and aggregates statistics over the entire season returning one row for each player per season.

Example Usage of nflscrapR

Below are a few examples of how one can use nflscrapR to do your own NFL research. Before we get going, here are a few things to note:

  • The data only goes back to 2009
  • This is a preliminary version of the software, so please let us know if you identify any issues
  • I am only using regular season data, but you can easily use playoff data if desired

Identifying and Defining Clutch Quarterback Play

An interesting concept in football is the idea of “clutchness” or whether or not a player performed well in a high-pressure situation. Many believed that Tim Tebow had it and yet he fizzled out of the NFL after only three seasons. We see this attribute for players in the Madden NFL video games, but how is it actually calculated? When does “clutch time” begin? How do we know if a player is clutch or not? These are questions that need answering, and I have taken a stab at answering them.

In this example, I will attempt to define and group “clutchness” for quarterbacks. I chose to define a clutch situation as any play in the fourth quarter that occurs with less than 5 minutes remaining with a one score point differential or less. Additionally, within the clutch time subset, I eliminated all quarterbacks with fewer than 20 attempts. Using a number of different summary statistics, I examined which quarterbacks performed best when the game was on the line, and then created groups for clutch quarterback play.

First, I took a look at completion percentage in clutch time situations. Examining the graph you can see that the usual suspects are up there. Roesthlisberger is believed to be “clutch” around the league and across numerous news sources. We see a number of elite quarterbacks such as Brees, Luck, and Palmer who are generally thought of as clutch. However, it is curious to see that players such as Winston, Davis, Webb, and Bridgewater have such high clutch time completion percentages. This is a product of their small sample sizes, as Winston just made the cut off for qualifying quarterbacks with 21 clutch time passes last season (the Bucs had a lot of close games), and the same can be said for many of the other young quarterbacks. Maybe one of the most striking players on the graph is Tony Romo. He was believed to be one of the most unclutch quarterbacks of all time ever since his fumbled extra point in the 2006 Wildcard Round. In any case, it’s clear that completion percentage is not the be all end all measurement of clutch quarterback play. So, I took a look at a few more statistics.

Screen Shot 2016-03-09 at 11.44.50 PM.png

Examining scoring was my next step. To find total points, I counted the number of clutch touchdowns each quarterback threw then multiplied by 7 (in expectation, touchdown are worth almost exactly 7 points). I then calculated points per attempt so that players were not rewarded or penalized for playing in more or fewer clutch time snaps. What I observed for both total points and points per attempt was similar to that of completion percentage. Points per attempt allows us to see how often the quarterbacks in our dataset find the endzone in clutch moments. We see that elite quarterbacks and veteran quarterback are the minority in the top 15 quarterbacks by points per attempt. A few reasons could be as follows:

  1. Quarterbacks who have played long have more attempts and thus, their point per attempt value is diluted.
  2. Elite quarterbacks often find themselves in “clutch situation” less often in the regular season due to their dominance. This could potentially be a sample size issue where elite quarterbacks have less opportunity to make plays in the clutch. Alternatively, when elite quarterbacks are in clutch situations they are often relied upon much more heavily than an average or young quarterback, which could lead to an inflation in the number of total passes and the number of poor passes resulting in incompletions.
  3. Our points formula only accounts for touchdowns that the quarterbacks throws or runs. This means that if a quarterback drives their team down to the 1-yard line and the tailback scores the touchdown then we do not credit the QB with generating any points (something to improve on in the future!)
  4. Our points formula also did not account for field goals kicked at the end of a drive

Even given the four potential sources of error listed above, it is interesting to see that mobile quarterbacks tend to have a higher points per attempt value. This could be because mobile quarterbacks are a threat to make plays on the run and through the air.

Now let’s take a look at total points. What you see is something a bit more expected. The younger quarterbacks have disappeared and many of the “elite” quarterbacks are listed (not you Ryan Fitzpatrick).

As I mentioned above, the data I am using only goes back to 2009, so players like Tom Brady and Peyton Manning who would be at the top of this list are found a few notches down or not at all.

These statistics give us another take on defining what clutch quarterback play looks like. There are five players found on both lists: Stafford, Palmer, Rodgers, Orton, and Luck. Three of these players are regarded as franchise quarterbacks. Without much thought, both of Rodgers’ hailmary tosses this year were timely plays (yet highly improbable) but those aside, it is generally agreed upon that Rodgers is a top three quarterback in the league, in part because of his performance in clutch time. Andrew Luck’s appearance here also makes sense. He played incredible in the 2013-2014 Wildcard round, leading an improbable comeback against the Chiefs. Then, in 2014-2015, he led the Colts to the AFC Championship game. On his career, Luck has 10 fourth quarter comebacks and 14 game-winning drives. The man has the ability to summon his best play when his team needs him most. Palmer and Stafford are both guns slingers who can get the job done when their teams are in need (for the most part). Orton was the curious case for me, but after looking into his come-from-behind win statistics it began to make sense. Orton has eight come from behind wins (six of which were after 2009) and has manufactured nine game-winning drives.

Screen Shot 2016-03-10 at 1.22.50 AM.png

Overall, the three statistics I identified do a decent job of summarizing clutch quarterback play. Each statistic has its biases but when looking at them in conjunction they tell us a more detailed story. But I wanted to take it a step further and try a more rigorous method to define clutchness.

To better identify clutch quarterback play and clutch quarterbacks I tested a model based clustering method using the mclust R package (see Fraley and Raftery 2002). Before starting, I added two more variables for each quarterback:

  • Clutch First Downs: The number of first downs the quarterback threw or ran for in clutch time
  • Interception Rate: Interceptions per attempt

The clustering method I used allowed me to group together the 76 quarterbacks in the sample. The Mclust function allows you to specify a number of clusters to test, so I tested how the cluster assignments would work if I specified a range of two to ten groups. The function then tests which number of groups maximizes BIC. In our case, the number of groups was three, hence our clustering method grouped each quarterback into one of three groups (think elite, average, replacement level).

The plot below allows visualization of 2-dimensional scatter-plots. Each point represents a quarterback and the points are colored by the associated cluster. If you study each of the scatter-plots, you can see that the different clusters are well grouped (in terms of distance) and have minimal between group overlap, which is ideal as it allows for easier separation of the clutch time statistics of each cluster.

Screen Shot 2016-03-10 at 1.24.54 AM.png

Below is a table of the different clusters, each with 7 of the member quarterbacks listed. Even by looking at a sample of quarterbacks names from each cluster differences are evident. Group 1 contains elite or Pro Bowl level quarterbacks, group 2 contains average starters, and group 3 contains backups or replacement level players.

Screen Shot 2016-03-10 at 1.25.35 AM.png

Summary statistics provide further insight into the differences between these groups. Each statistic in the below table is an average of each of the different clusters. Looking at total first downs and total points, we see that group 1 dominates the other two groups. Group 1 also has the edge in regards to Completion Percentage and Points per Attempt.

The model based clustering algorithm method uses the five statistics I collected to group quarterbacks by their clutch play, and based on my knowledge of football it seemed to do a good job of assigning groups and separating the quarterbacks by general skill level and success. Overall, I am happy with the results but some improvements can definitely be made. Using expected points instead of points per attempt or total points would account for more of the point contribution for the quarterbacks (Brian Burke). It would be nice if I had data going back earlier than 2009, as players such as Tom Brady and Brett Farve seem to be undervalued. That said, this is a good start, and it allows us to begin to visualize and define clutch quarterback play.

Screen Shot 2016-03-10 at 1.25.41 AM.png

Concluding Remarks

Using the nflscrapR package in this analysis allowed me to gather data within seconds so I could spend the majority of my time exploring the data for insights.  Quantifying clutchness been an area of research in the NFL for years and I was able to build a preliminary model to define clutchness with the proper data and just a few hours of hard work.  This is just one of the many incredible things that the nflscrapR package offers to the NFL analytics community! If you want to start your own analysis, download the package using the following R code and begin transforming NFL analytics and building a name for yourself:


“`{r, eval = FALSE}
install.packages(devtools)
library(devtools)

devtools::install_github(repo = “maksimhorowitz/nflscrapR”)
“`


 

 

Advertisements

6 thoughts on “Introducing nflscrapR – Part 1

  1. Hey, I’m trying to go through the example you posted on the github page, and am running into some issues.
    > players2009 <- season_player_game(2009)
    Error in match.fun(FUN) : object 'playergame' not found

    other functions are working, but there seems to be some bug in season_player_game(). It might stem from the difference between playergame and player_game, which was an issue on the github page.
    Just wanted to let you know, love the package.

  2. Hi,

    I would like to know if anybody’s approached you about promoting a live interactive sports mobile app from United Games? People following your blog will be playing this game in the near future. The question is, do you want to make money promoting it? United Games is starting with the NFL in late September but this will go to all sports around the world (this is Not fantasy sports or gambling).

    Here’s the link to a short video overview: 4.5 min – https://www.youtube.com/watch?v=NKupLObtC7g

    If you find you have an interest in having a further conversation I’d like to speak with you.

    Thank you for your time and consideration.

    Steve

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s