Predicting the Next Year’s Worth of Baseball Hall of Famers

Written by Scott Steinberg

Increasingly publicized voting ballots, heated debates over social media, and a seemingly higher frequency of controversial Curt Schilling tweets mean that the release of the Baseball Hall of Fame Class of 2017 is here. This voting season’s significance is marked by the final year of Tim Raines’s voting eligibility, the first year on the ballot for players such as Vladimir Guerrero and Ivan Rodriguez, and a potential make-or-break year for noted steroid users such as Roger Clemens and Barry Bonds. With so much at stake over the next few years, how can voting history be used to predict who makes the Hall of Fame and who has to settle for the Hall of Very Good?

In order to create my Hall of Fame prediction model, I used the voting history and statistics of all players who entered the Hall of Fame ballot since 1991. I based my model on numerous factors, found each factor’s correlation with a player’s voting percentage from their last year on the ballot, and added each player’s standardized score for each factor (weighted by that factor’s R-squared value) to create one number predicting the player’s fate (credits to Matthew Yudin for helping me create this). The factors and their correlations are listed below.

  • WAR: Wins Above Replacement. This number estimates how many more games a team would win if they played this player instead of a replacement-level player.picture1
  • WAR7: The player’s combined WAR over their best 7 seasons.picture2
  • Number of All-Star Game Appearances:picture3.png
  • # of Benchmarks the Player Reached: This includes 3000 Hits, 500 Home Runs, 3000 Strikeouts, and 300 Wins. I chose these benchmarks because every player in MLB who reached at least one of those benchmarks has made the Hall of Fame, not including those affiliated with steroids or betting on baseball. These benchmarks can help explain, for example, why Frank Thomas (521 HRs) got in easily while Fred McGriff (493) will likely fall off the ballot, even though both players were 5-time All-Stars. These benchmarks also may be the most likely brought-up topics when discussing a player’s Hall of Fame case.picture4.png
  • Combined # of MVP and Cy Young Awards:picture5.png

I spent a great deal of time plugging in various other factors, including World Series championship rings, WAR-WAR7 (a measure of the player’s effectiveness outside their prime), and the player’s best single-season WAR. Ultimately, the above factors, when weighing in their respective correlation coefficients, proved to showcase the most accurate indicators of final vote percentage–at least for the last 25 years.

Because the criteria for electing relief pitchers into the Hall of Fame is unique, I created an entirely different model for them. In addition to All-Star games and MVP/Cy Young Awards, I factored in WAR-WAR7, postseason ERA, and saves. I also included a game-entering leverage index, which I got from Baseball Reference. The leverage index a number quantifying the average pressure of the situation that the pitcher faces in a game (i.e. a pitcher entering a game with the bases loaded, one run lead, no outs and Mike Trout at the plate would have a very high leverage index).

Factor R-squared Value
All-Star Games 0.63
MVP/Cy Young Awards 0.37
WAR-WAR7 0.28
Postseason ERA -0.03
Saves 0.12
Game-entering Leverage Index 0.27

Both models are strong predictors of modern Hall of Fame voting. For the model including position players and starting pitchers, the top 35 players all made the Hall of Fame, and the only players outside the top who made it are Kirby Puckett (ranked 43rd out of 323), Jim Rice (45th) and Tony Perez (55th).  Non-HOFers within the top 55 include Dale Murphy, Fred Lynn, Bobby Grich, David Cone, Alan Trammell, Bret Saberhagen, Vida Blue, Lofton, Dave Stieb, Graig Nettles, Keith Hernandez, Lou Whitaker, Steve Garvey, Buddy Bell, Willie Randolph, Dave Parker, and Ted Simmons.picture5.png

Pitcher Score
Randy Johnson 9.34
Tom Seaver 8.82
Greg Maddux 8.34
Cal Ripken 7.54
Mike Schmidt 6.71
Steve Carlton 6.66
Rod Carew 6.38
Rickey Henderson 6.03
Phil Niekro 5.89
Pedro Martinez 5.86
George Brett 5.83
Ken Griffey Jr. 5.73
Nolan Ryan 5.62
Wade Boggs 5.39
Reggie Jackson 5.38
Tom Glavine 5.20
Eddie Murray 5.11
Tony Gwynn 4.75
Don Sutton 4.18
Frank Thomas 4.13

My model including relief pitchers has a slightly stronger correlation, albeit a smaller sample size. I manipulated the factors in this model to ensure that the four relievers to enter Cooperstown over the last twenty-five years (Dennis Eckersley, Rollie Fingers, Goose Gossage and Bruce Sutter) not only ranked the highest, but also guaranteed that Lee Smith (who’s bound for the Hall of Very Good) has a lower rating than anyone in the group. The five highest-rated relievers are Eckersley (5.0905), Gossage (3.0723), Fingers (3.0251), Sutter (2.3021), and John Franco (0.9347).picture7

The graph including the two groups together has a strong R-squared value of 0.73284. In comparison to well-known Hall of Fame prediction models such as Bill James’s Hall of Fame Monitor (R-squared value of 0.65639) and Jay Jaffe’s JAWS system (.47463), my algorithm (at least when considering the voting history of the last twenty-five years) is the strongest predictor.picture8.pngpicture9.pngpicture10.png

To predict a player’s Hall of Fame fate using my model, we calculate his score and plug it back into the model to get a player’s projected final voting percentage. However, let’s take it a step further and look at correlations for all ten years a player can be on the ballot to estimate when they get the call into Cooperstown. The only issue at this point is small sample size; only 27 players in my model were still on the ballot by year 4, and only 12 were still there by year 10. In order to make my correlations statistically significant, I extended outside my 25-year window to add more players with longer stays on the ballot, including players on ballots as early as 1974 such as Roger Maris and Elston Howard.

Even though the model has been created, the trickier part is still figuring out how exactly the model can predict a player’s path to the Hall. The easiest thing to do is to take every player in the data set, create a graph, and use the results of their first year to analyze the trends of the dataset to predict the rest of the player’s fate. But, this results in awkward outcomes such as Chipper Jones taking seven years to gain election and Vladimir Guerrero only getting 39% of the vote in his first year and not even getting elected. Let’s solve this problem by calculating first-year voting percentage the same way I created the score: by separating the hitters/starters and relief pitchers. Since relief pitchers in my model may have an inflated score, considering their typically low percentages in their first year (especially for Bruce Sutter and Goose Gossage), this makes the most sense in order to make the best prediction. So, I created four more linear regressions: one for hitters/pitchers with a score of 2 or above (typically a sure-fire Hall-of-Famer), one for those with a score between 1 and 2 (borderline honorees), one for those with a score below 1 (none of these players have gotten in, so they will be ignored for the most part), and a graph for all relief pitchers in my model.

Group 1st-Year Percentage Equation (x = score) R-squared Value
Batters/Starters (Score 2+) 0.0747x + 0.4323 0.46
Batters/Starters(Score 1-2) 0.0671x + 0.0604  0.01
Batters/Starters(Score Under 1) 0.0116x + 0.0188 0.15
Relief Pitchers 0.1595x + 0.0895 0.75

If we were to apply the equation to the players debuting on this year’s ballot, this is what the percentages would look like. I added several inactive layers appearing on future ballots to further demonstrate the applications of my model. Quick note: I did factor for steroids. Players with a weak link (i.e. the steroid allegations won’t completely destroy their chances, like Ivan Rodriguez) see their score drop 25%, while players who had their reputations completely tarnished due to the allegations (think Barry Bonds and Roger Clemens) see an 85% decrease.

Player Year Score First-Year Percentage
Ivan Rodriguez 2017 2.50 61.9%
Vladimir Guerrero 2017 2.26 60.1%
Manny Ramirez 2017 0.64 2.6%
Magglio Ordonez 2017 0.39 2.3%
Jorge Posada 2017 0.35 2.3%
Jim Thome 2018 3.16 66.8%
Chipper Jones 2018 2.97 65.4%
Scott Rolen 2018 1.85 18.4%
Andruw Jones 2018 1.42 15.6%
Johan Santana 2018 1.32 14.9%
Omar Vizquel 2018 -0.18 1.7%
Mariano Rivera 2019 5.37 94.6%
Roy Halladay 2019 3.04 65.4%
Todd Helton 2019 1.38 15.3%
Lance Berkman 2019 1.01 12.8%
Derek Jeter 2020 4.76 78.8%
Tim Hudson 2021 0.82 2.8%
David Ortiz 2022 2.50 61.9%
Alex Rodriguez 2022 1.47 15.9%

Some estimations, especially those for borderline players, seem reasonably plausible; although Guerrero and Ivan Rodriguez appear to be on the verge of entering the Hall of Fame in their first year on the ballot. However, most of these numbers are gross, and perhaps even extreme, underestimations. Obviously, there is no way Derek Jeter only garners 78.8% of the vote (even though he still gets in in his first year regardless); there is no chance Manny Ramirez completely drops off the ballot this year; and, it is highly improbable that Omar Vizquel or Tim Hudson will drop off either. Also, Todd Helton’s rather low score in my model could mean that I am either not factoring in batting average strongly enough or that he is bound to receive the Larry Walker treatment from voters (Walker’s score, by the way, is 2.03, and his career batting average is only slightly lower than Helton’s).

In order to approximate a player’s voting percentages over subsequent years, I use his score and voting percentage from the previous year, I take the player’s previous year voting percentage and add his forecasted change based on his score and specific year on the ballot.

Years Equation (x = score) R-squared Value
Years 1 to 2 0.0151x – 0.0057 0.08403
Years 2 to 3 0.0161x – 0.001 0.08444
Years 3 to 4 0.0124x + 0.0103 0.09878
Years 4 to 5 0.0156x – 0.0163 0.09759
Years 5 to 6 0.0119x + 0.0019 0.0316
Years 6 to 7 0.0121x + 0.0041 0.06087
Years 7 to 8 0.0188x – 0.0112 0.0914
Years 8 to 9 0.0159x + 0.0019 0.08743
Final Year 0.0109x + 0.0192 0.03259

Using the above formulas, let’s see how the next potential batch of Hall-of-Famers pan out. (NOTE: I gave Curt Schilling a slight steroid weighting to make up for his loss of votes following his frequent controversial tweets.)

Player Year Score 1 2 3 4 5 6 7 8 9 10 14 15
Lee Smith 2003 1.95 42.3% 36.6% 38.8% 45.0% 39.8% 43.3% 44.5% 47.3% 45.3% 50.6% 34.1% 38.2%
Tim Raines 2008 1.77 24.3% 22.6% 30.4% 37.5% 48.7% 52.2% 46.1% 55.0% 69.8% 73.6%
Edgar Martinez 2010 1.81 36.2% 32.9% 36.5% 35.9% 25.2% 27.0% 43.4% 45.7% 48.7% 52.6%
Fred McGriff 2010 0.73 21.5% 17.9% 23.9% 20.7% 11.7% 12.9% 20.9% 21.1% 22.5% 25.2%
Jeff Bagwell 2011 2.18 41.7% 56.0% 59.6% 54.3% 55.7% 71.6% 74.6% 77.6%
Larry Walker 2011 2.03 20.3% 22.9% 21.6% 10.2% 11.8% 15.5% 18.4% 21.1% 24.5% 28.6%
Curt Schilling 2013 2.86 38.8% 29.2% 39.2% 52.3% 55.1% 58.7% 62.6% 66.9% 71.6% 76.6%
Roger Clemens 2013 1.81 37.6% 35.4% 37.5% 45.2% 46.4% 48.7% 51.3% 53.6% 56.7% 60.6%
Barry Bonds 2013 1.64 36.2% 34.7% 36.8% 44.3% 45.2% 47.4% 49.7% 51.7% 54.5% 58.2%
Sammy Sosa 2013 0.55 12.5% 7.2% 6.6% 7.0% 6.2% 7.1% 8.1% 8.1% 9.1% 11.6%
Mike Mussina 2014 1.86 20.3% 24.6% 43.0% 46.3% 47.6% 50.0% 52.7% 55.1% 58.2% 62.2%
Jeff Kent 2014 1.23 15.2% 14.0% 16.6% 19.1% 19.4% 21.1% 23.0% 24.2% 26.3% 29.5%
Gary Sheffield 2015 0.51 11.7% 11.6% 12.3% 14.0% 13.1% 13.9% 15.0% 14.8% 15.8% 18.3%
Trevor Hoffman 2016 2.30 67.3% 70.2% 73.8% 77.7%
Billy Wagner 2016 1.59 10.5% 12.3% 14.8% 17.8% 18.6% 20.7% 23.0% 24.9% 27.6% 31.2%
Ivan Rodriguez 2017 2.50 61.9% 65.1% 69.0% 73.1% 75.4%
Vladimir Guerrero 2017 2.26 60.1% 63.0% 66.5% 70.4% 72.3% 75.1%
Jim Thome 2018 3.16 66.8% 71.0% 76.0%
Chipper Jones 2018 2.97 65.4% 65.4% 69.3% 74.0% 78.7%
Scott Rolen 2018 1.85 18.4% 20.6% 23.5% 26.8% 28.1% 30.5% 33.1% 35.5% 38.6% 42.5%
Andruw Jones 2018 1.42 15.6% 17.2% 19.3% 22.1% 22.7% 24.6% 26.7% 28.3% 30.7% 34.2%
Johan Santana 2018 1.32 14.9% 16.3% 18.3% 21.0% 21.4% 23.2% 25.2% 26.6% 28.8% 32.2%
Roy Halladay 2019 3.04 65.4% 65.9% 69.9% 74.7% 79.5%
Todd Helton 2019 1.38 15.3% 16.8% 18.9% 21.7% 22.2% 24.0% 26.1% 27.6% 30.0% 33.4%
Lance Berkman 2019 1.01 12.8% 13.8% 15.3% 17.6% 17.6% 19.0% 20.6% 21.4% 23.2% 26.2%
David Ortiz 2022 2.50 61.9% 61.9% 65.1% 69.1% 73.2% 75.5%
Alex Rodriguez 2022 1.47 15.9% 15.9% 17.5% 19.8% 22.7% 23.3% 25.3% 27.5% 29.1% 31.6% 35.1%

So, what to make of this chart? If the numbers above hold true, then there will be no new members in the Hall of Fame this year. This is extremely unlikely considering both Jeff Bagwell and Tim Raines have already surpassed their estimated net gain in voters needed to cross the 75% threshold according to @NotMrTibbs’s BBHOF tracker as of January 5, 2017. Also, even though the model predicts Chipper Jones should take four years to enter Cooperstown, he realistically will have no difficulties entering the Hall during his first year. Another factor for the underestimations in player voting, especially towards the end of the ballot, is that the ten-year rule has only been in place for two years, and this is the first year in which a player will be on their tenth and final year on the ballot. Due to this, all of my voting data came from years in which players stayed on the ballot for fifteen years. Since the early polls point to Raines easily passing the threshold this year, this year should be an interesting indicator of how the rule change affects balloting results. If not for this change, the numbers would point to Raines, Mussina, Clemens, and Bonds eventually entering the Hall, with Edgar Martinez just falling short.

Even though a linear regression for this data would not be the best indicator for exact numbers, it still has some predictive power. Importantly, it demonstrates the strong correlation between several select statistics and a player’s entry into the Hall of Fame. Even though there are some contradictions involving players not originally included in my model such as Raines and Mussina’s lower scores and Walker’s relatively high score, the score is generally predictive of a player’s likelihood. And similar to the players in this model, if players such as Raines and Mussina do enter the Hall of Fame, the model could change to accommodate them; perhaps Raines’s elite stolen base numbers or Mussina’s remarkable consistency would be weighted heavier. Perhaps MVP awards should be unweighted to accommodate Walker’s struggles to gain momentum (without his one MVP award, Walker’s score is about a 1.6), or his score could indicate a potential turnaround in voting for him. In truth, no matter what the numbers say, Hall of Fame voting will always be a debate. Highly-rated players will often be overlooked (Walker), and less than highly-rated players will be voted in (Tony Perez, 1.146).

Who do you think should or will eventually make the Hall of Fame? Please leave a comment below; the debates shall go on and on.

Also, for those curious, here are the top 20 scores for currently active players.

Player Score
Albert Pujols 6.88
Ichiro Suzuki 4.22
Miguel Cabrera 3.46
Clayton Kershaw 3.08
Carlos Beltran 2.68
Adrian Beltre 2.07
Mike Trout 2.00
Justin Verlander 1.95
Robinson Cano 1.93
Chase Utley 1.75
CC Sabathia 1.56
Felix Hernandez 1.41
Joe Mauer 1.39
Francisco Rodriguez 1.31
Dustin Pedroia 1.20
David Wright 1.19
Jonathan Papelbon 1.19
Joey Votto 1.11
Joe Nathan 1.04
Max Scherzer 0.99

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s