chances is…

Run Expectancy and Markov Chains

August 14th, 2011

Sorry for the long interval between entries – I hope to get back to posting on more regular basis. Continuing in the vein of my previous two posts, I’m still working my way towards baseball win expectancy, but I’m going to pause to examine run expectancy in a more detailed manner.

First, let’s look back at the run expectancy matrix from my last post. It was built by looking at each time a given base-out state occurred, and seeing how many runs were scored in the remainder of those innings (by utilizing the FATE_RUNS_CT field from Chadwick). I will refer to this as empirical run expectancy, as it is based on how many runs were actually scored following each base-out state.

Run Expectancy Matrix, Empirical
BASES	0 OUTS	1 OUT	2 OUTS
___	0.539	0.289	0.111
1__	0.929	0.555	0.24
_2_	1.172	0.714	0.342
__3	1.444	0.984	0.373
12_	1.542	0.948	0.464
1_3	1.844	1.204	0.512
_23	2.047	1.438	0.604
123	2.381	1.62	0.798

Read the rest of this entry »

Posted by in Databases, Markov models, R, Retrosheet, Run/Win Expectancy | 5 Comments »

Run Expectancy and Base-Out Leverage Index

December 19th, 2010

Before I get into win expectancy, Win Probability Added, Leverage Index, and WPA/LI, I want to take a look at run expectancy, RE24, Base-Out Leverage Index, and RE24/boLI. Most of these stats were created and/or popularized by Tangotiger (his intro to Leverage Index is here). I have tried to mimic his methodology as closely as possible, but there may be some differences.

Read the rest of this entry »

Posted by in Databases, Leverage Index, Retrosheet, Run/Win Expectancy | 2 Comments »

Building a Retrosheet Database, Part 1

October 27th, 2010

I want to be able to calculate Tangotiger’s WPA/LI stat (Win Probability Added/Leverage Index, a.k.a. situational wins, context neutral wins, or game state linear weights). To do that, I need to be able to calculate WPA and LI. To do that, I need to construct a Win Expectancy matrix. To do that, I need to build a Retrosheet database. So that’s where I’m going to start. I’ve never worked with a database or explored any Retrosheet data before, so I am starting from scratch (though I will be utilizing a lot of great resources from around the web). In a series of posts I will describe my process step-by-step. If you want to follow along, make sure you have a lot of free disk space (the parsed data files for all seasons take up over 5 GB). Also be aware that some of my instructions will be Windows-specific.

Read the rest of this entry »

Posted by in Databases, Retrosheet, Run/Win Expectancy | 2 Comments »

The Distribution of Talent Between Teams

October 20th, 2010

Four years ago Tango had a very interesting post on how talent is distributed between teams in different sports leagues. I want to revisit and expand upon some of the points that came up in that discussion.

First, lets look at some empirical data. I scraped end of season records from the last ten years for the NFL, NBA and MLB from ShrpSports (I decided to omit the NHL from this analysis due to the prevalence of ties). The data is available here (click through) as a tab-delimited text file. I used R to analyze the data. If you don’t have R you can download it for free (if you use Windows I recommend using it in conjunction with Tinn-R, which is great for editing and interactively running R scripts). Here is the R code I used:

^?View Code RSPLUS

records = read.delim(file = "records.txt")
lgs = data.frame(league=c("NFL","NBA","MLB"),teams=c(32,30,30),games=c(16,82,162))
lgs$var.obs[lgs$league == "NFL"] = var(records$win_pct[records$league == "NFL"])
lgs$var.obs[lgs$league == "NBA"] = var(records$win_pct[records$league == "NBA"])
lgs$var.obs[lgs$league == "MLB"] = var(records$win_pct[records$league == "MLB"])
lgs$var.rand.est = .5*(1-.5)/lgs$games
lgs$var.true.est = lgs$var.obs - lgs$var.rand.est
lgs$regress.halfway.games = lgs$games*lgs$var.rand.est/lgs$var.true.est
lgs$regress.halfway.pct.season = lgs$regress.halfway.games/lgs$games
lgs$noll.scully = sqrt(lgs$var.obs)/sqrt(lgs$var.rand.est)
lgs$better.team.better.record.pct = 0.5 + atan(sqrt(lgs$var.obs - lgs$var.rand.est)/sqrt(lgs$var.rand.est))/pi
lgs

Here is the resulting table:

Read the rest of this entry »

Posted by in Comparing Sports, R, Simulation, Talent Distribution | 7 Comments »

The Origins of Log5

October 3rd, 2010

Of all Bill James’ sabermetric innovations, my favorite has always been the log5 formula for determining matchup probabilities. It provides a method for taking the strengths of two teams and estimating the probability of one team beating the other. It can also be applied to individual player matchups, such as a batter facing a pitcher.

Here is a common way to express the formula for the context of team matchups:

$Win\%_{A vs. B} = \dfrac{Win\%_A \times (1 - Win\%_B)}{(Win\%_A \times (1 - Win\%_B)) + ((1 - Win\%_A) \times Win\%_B)}$

Unfortunately this formulation doesn’t shed much light on why James called this log5, or where the formula came from in the first place.

James’ Original Formulation

James introduced the formula in the 1981 Baseball Abstract, which is excerpted here. In his initial presentation, James first converted each team’s winning percentage (or $p$ , their probability of success) into what he called their $log5$ .

$\dfrac{log5}{log5 + .500} = p$

Solving for $log5$ :

$log5 = .500 \times \dfrac{p}{1 - p}$

After this conversion, the formula is simple:

$p_{AvB} = \dfrac{log5_A}{log5_A + log5_B}$

Logarithms, Odds, and Odds Ratios

So where does the “log” in log5 come from? I’m not sure exactly where James got it, but there is a connection to the logit function:

$logit(p) = log\left(\dfrac{p}{1 - p}\right)$

That $\frac{p}{1 - p}$ term was present in James’ formulation. It is what is known as the odds. It’s common term in gambling — if some event has a .75 probability, the odds are $\frac{.75}{(1 - .75)} = 3$ , typically expressed as 3:1 or 3 to 1.

Framing things in terms of odds rather than probabilities can be helpful.

$Odds = \dfrac{p}{1 - p}$

We can replicate the log5 formula by simply taking the odds ratio, which is just the odds for team A divided by the odds for team B.

$OddsRatio_{AvB} = \dfrac{Odds_A}{Odds_B}$

To convert this back to a probability we need one final step:

$p_{AvB} = \dfrac{OddsRatio_{AvB}}{1 + OddsRatio_{AvB}}$

Combining these steps, we have a simple formulation of log5:

$p_{AvB} = \dfrac{Odds_A}{Odds_A + Odds_B}$

This matches James’ original formulation, but here we see that one can use simple odds rather than James’ log5 term (which contains an unnecessary .500 multiplier).

Tying this back into the logit function, we can reformulate log5 to say that the matchup probability is equal to the inverse-logit of the log of the odds ratio (of course, it’d be simpler to just say that the matchup odds are equal to the odds ratio, but then we’d be leaving the “log” out of “log5″).

The “5″ in “log5″ and a More General Formulation

The “5″ part of “log5″ was in reference to the fact that teams were being compared to .500, the average winning percentage. But when we’re dealing with individual matchups, the league average isn’t always .500 (for a batter/pitcher matchup to estimate the probability of a hit, we would need to use the league-wide batting average). To deal with this we need to add another term to the formula representing the league average probability (or odds).

In the odds ratio formulation, this is easy. We just divide by the league average odds ( $Odds_{LG}$ ). When the league average probability is .500, the odds are $\frac{.500}{(1 - .500)} = 1$ , so the term can be omitted without consequence.

$OddsRatio_{AvB} = \dfrac{\frac{Odds_A}{Odds_B}}{Odds_{LG}}$

Converting this to a probability, we have what I find to be the clearest formulation of the generalized log5 formula:

$p_{AvB} = \dfrac{Odds_A}{Odds_A + (Odds_B \times Odds_{LG})}$

Precedents

So was Bill James the first to discover the log5 formula? Not exactly. It turns out that log5 is a variation of the Bradley-Terry model for pairwise comparison, which was first published in 1952 (and which itself was a variation on a 1929 work of German mathematician Ernst Zermelo). The formula given on the Wikipedia page is equivalent to the inverse-logit formulation I discussed above, if the logs of each team’s odds are used for the scale locations (their formula uses the difference of the logs of the odds, which is equal to the log of the ratio of the odds that I used). Jim Albert and Jay Bennett discussed the Bradley-Terry model in of their excellent book, “Curve Ball.” The Bradley-Terry model has been used for rating systems in many sports, including hockey and chess (I highly recommend Mark Glickman’s article “A Comprehensive Guide to Chess Ratings” for more background on paired comparisons and the connection between log5/Bradley-Terry and the logistic distribution).

For more on log5, here’s a good early piece by Dean Oliver, which includes a shortcut formula that mirrors one discussed by Joe Arthur in a great thread from Tango’s blog. Mike Tamada has also written some lucid intros to log5 and . Hal Stern’s work on paired comparisons is worth hunting down – he explicitly makes the link between the logit function and log5 in of “Statistical Thinking in Sports” (for more references to his work see this comprehensive bibliography on sports ranking systems, which points to a lot of other relevant articles). Padres analyst Chris Long also made the connection between log5 and Bradley-Terry in a presentation he gave last year. And finally, Steven Miller has written a nice short paper that provides a justification of log5 using the geometric series formula.

Posted by in Log5 | 1 Comment »

A Perl version of Tango’s Markov Model

September 28th, 2010

I have created a Perl version of Tangotiger’s excellent Markov run modeler. Tango’s original HTML/Javascript version can be found here, with further discussion here.

This is just a basic adaptation – I have not added any new features, though I hope to in the future (at the very least I would like to make a Perl version of Bill Skelton’s modification of Tango’s original).

To use my version, first download the zip file (markov.zip), extract the Perl script (markov.pl) and the example input file (input.csv), and place them in the same directory. Change the values in the input.csv file to alter the batting line and the chances of taking an extra base (but make sure not to alter the formatting of the file). Then just run the Perl script, which will produce a file named output.txt that is tab-delimited. If you open that in Excel you should be able to view all the results in table form. For simplicity’s sake I didn’t include any command line arguments to specify the names of the input or output files, so if you want to run the script multiple times and save your results you’ll either have to rename/copy the output file or alter the Perl script (note that the output file does include the input values inside it).

For those unfamiliar with Markov models of baseball, there are a lot of great resources on the web. Outside of Tango’s site, I recommend work by Mark Pankin, Joel Sokol (includes Matlab code), Bruce Bukiet (scroll down for “A Markov Chain Approach to Baseball”), Carl Morris, John Beamer (includes Excel spreadsheet with purchase), Tom Ruane, and Berselius (includes Matlab code, though link appears to be down).

Posted by in Markov models, Perl | 2 Comments »

Adjusted Plus-Minus in Hockey

September 26th, 2010

Though plus-minus originated in hockey and has been tracked by the NHL since 1968, development of the stat really picked up when it was adopted for use in basketball in the last decade. In 2000 Wayne Winston and Jeff Sagarin developed their WINVAL system for basketball, which starts with basic plus-minus but then uses linear regression to control for the quality of the player’s teammates and opponents. The methodology behind this has been public since Dan Rosenbaum wrote about his NBA version of adjusted plus-minus in 2004, but surprisingly, until earlier this year there had been no public examples of using regression to adjust plus-minus for hockey players (Vic Ferrari came the closest here). So I was excited to see Brian Macdonald’s paper “A Regression-based Adjusted Plus-Minus Statistic for NHL Players” become available in June. It hasn’t gotten much pub around the net (no mention on any of the hockey blogs as far as I can tell), but hopefully more people will become aware of it.

Posted by in Adjusted Plus-Minus | 1 Comment »

The Beginnings of Base Runs

September 26th, 2010

I recently came across the from December of 1999 where David Smyth introduced his run estimator Base Runs. I also found archived versions of the summary he posted a few months later on James Fraser’s site, and a primer he posted on Fanhome several years later (follow-up discussion here).

For more on Base Runs I highly recommend articles by Patriot and Tangotiger.

Posted by in Base Runs, Run Estimators | No Comments »

chances is…

Run Expectancy and Markov Chains

Run Expectancy and Base-Out Leverage Index

Building a Retrosheet Database, Part 1

The Distribution of Talent Between Teams

The Origins of Log5

A Perl version of Tango’s Markov Model

Adjusted Plus-Minus in Hockey

The Beginnings of Base Runs

sports blogs

stat blogs

posts by month

posts by category

feeds, etc.

search