Run Expectancy and Markov Chains
August 14, 2011Posted by in Databases,Markov models,R,Retrosheet,Run/Win Expectancy
Sorry for the long interval between entries – I hope to get back to posting on more regular basis. Continuing in the vein of my previous two posts, I’m still working my way towards baseball win expectancy, but I’m going to pause to examine run expectancy in a more detailed manner.
First, let’s look back at the run expectancy matrix from my last post. It was built by looking at each time a given base-out state occurred, and seeing how many runs were scored in the remainder of those innings (by utilizing the FATE_RUNS_CT field from Chadwick). I will refer to this as empirical run expectancy, as it is based on how many runs were actually scored following each base-out state.
Run Expectancy Matrix, Empirical | |||
BASES | 0 OUTS | 1 OUT | 2 OUTS |
___ | 0.539 | 0.289 | 0.111 |
1__ | 0.929 | 0.555 | 0.24 |
_2_ | 1.172 | 0.714 | 0.342 |
__3 | 1.444 | 0.984 | 0.373 |
12_ | 1.542 | 0.948 | 0.464 |
1_3 | 1.844 | 1.204 | 0.512 |
_23 | 2.047 | 1.438 | 0.604 |
123 | 2.381 | 1.62 | 0.798 |
There’s another way to calculate run expectancy that doesn’t look at what actually happened in the rest of the inning. It is based on solely on the base-out state transition probabilities (i.e. how frequently each base-out state transitions into each of the other base-out states). These probabilities were calculated in my previous post as they were needed to calculate base-out leverage index. We can create a Markov chain model based on these transition probabilities to simulate how many runs should score over the rest of the inning. In essence what the Markov chain does is play out all the possible sequences of base-out states over the rest of the inning (keeping track of the number of runs scored), using the base-out transition probabilities to weight each transition.
I’ll show how this can be done in R, following the method described in an excellent Mark Pankin article (with some modifications of my own). I’m not going to go through the details of the math behind the Markov model – for that see Pankin’s piece.
First, load in the base-out state transitions that we pulled from the Retrosheet database (I made a slight modification so that the columns are START_BASES, START_OUTS, END_BASES, END_OUTS, EVENT_RUNS, and PA – you can download the CSV here). Then we reshape this data into three matrices: trans.freq, the frequency of each transition; trans.runs, the runs scored on each transition; and trans.prob, the probability of each transition. Note that there are 32 states rather than 24, with the extra 8 representing the end-of-inning three-out states (one could group these by runs scored, as Pankin does, but I have chosen to leave them grouped by base state). All of the extra states are set to have a 100% probability of transitioning into the bases empty, three-out state, so this is where the Markov chain will always eventually end up.
bo.trans = read.csv("bo_transitions.csv", header = TRUE) # create header names start.bo = NULL end.bo = NULL bo = NULL bases = c('___','1__','_2_','12_','__3','1_3','_23','123') for (b in 0:7) { for (o in 0:3) { start.bo[b + 8*o + 1] = paste('s.b', bases[b+1], '.o', o, sep = '') end.bo[b + 8*o + 1] = paste('e.b', bases[b+1], '.o', o, sep = '') bo[b + 8*o + 1] = paste('b', bases[b+1], '.o', o, sep = '') } } # trans.freq - the frequency of each state transition # trans.runs - the runs scored on each state transition trans.freq = array(0, c(32,32), dimnames = list(start.bo, end.bo)) trans.runs = array(0, c(32,32), dimnames = list(start.bo, end.bo)) for (r in 1:nrow(bo.trans)) { row.num = bo.trans$START_BASES[r] + 8*bo.trans$START_OUTS[r] + 1 col.num = bo.trans$END_BASES[r] + 8*bo.trans$END_OUTS[r] + 1 trans.freq[row.num,col.num] = trans.freq[row.num,col.num] + bo.trans$PA[r] trans.runs[row.num,col.num] = trans.runs[row.num,col.num] + bo.trans$EVENT_RUNS[r] } # trans.prob - the probability of each state transition trans.prob = trans.freq for (r in 1:24) { trans.prob[r,] = trans.prob[r,]/sum(trans.prob[r,]) } for (r in 25:32) { trans.prob[r,25] = 1 } |
Before we get to the Markov chain, we can write a simple simulator to estimate the run expectancy for each base-out state. For each starting base-out state, we simulate a number of innings, using the base-out state transition probabilities to determine the progression of states in each inning (R’s ‘sample’ function makes this easy), and keeping track of the total runs scored in the inning.
# run expectancy, simulation method innings = 10000 re.simulation = array(0,c(24,1)) for (s in 1:24) { for (i in 1:innings) { s.state = s while (s.state != 25) { e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,]) re.simulation[s] = re.simulation[s] + as.numeric(trans.runs[s.state,e.state]) s.state = e.state } } re.simulation[s] = re.simulation[s]/innings } rownames(re.simulation) = bo[1:24] colnames(re.simulation)[1] = 're' |
Run Expectancy Matrix, Simulation | |||
BASES | 0 OUTS | 1 OUT | 2 OUTS |
___ | 0.539 | 0.284 | 0.113 |
1__ | 0.935 | 0.542 | 0.241 |
_2_ | 1.121 | 0.683 | 0.337 |
__3 | 1.398 | 0.967 | 0.366 |
12_ | 1.518 | 0.939 | 0.462 |
1_3 | 1.805 | 1.179 | 0.499 |
_23 | 2.032 | 1.381 | 0.614 |
123 | 2.388 | 1.643 | 0.774 |
Using a Markov chain model, we can calculate run expectancy directly, without having to simulate innings. First we need to calculate the expected runs scored from each base-out state on the next event only. This is easy to do using matrix multiplication.
# bo.event.re - the expected runs scored from each state on the next event only bo.event.re = array(0, c(32,1), dimnames = list(bo)) for (r in 1:nrow(trans.prob)) { bo.event.re[r,] = trans.prob[r,] %*% trans.runs[r,] } |
Now we can iteratively go through the inning batter by batter. In theory, the inning could go on forever before the third out is recorded, but we’ll cap things at 25 batters.
# run expectancy, iterative method re.iteration = bo.event.re trans.prob.power = diag(32) for (x in 1:25) { trans.prob.power = trans.prob.power %*% trans.prob re.iteration = re.iteration + trans.prob.power %*% bo.event.re } re.iteration = as.matrix(re.iteration[1:24]) rownames(re.iteration) = bo[1:24] colnames(re.iteration)[1] = 're' |
Or, more simply, we can use a closed-form formula to directly calculate run expectancy (note that here we are only using the typical 24 base-out states).
# run expectancy, formula method re.formula = solve(diag(24) - trans.prob[1:24,1:24]) %*% bo.event.re[1:24] rownames(re.formula) = bo[1:24] colnames(re.formula)[1] = 're' |
Run Expectancy Matrix, Iteration/Formula | |||
BASES | 0 OUTS | 1 OUT | 2 OUTS |
___ | 0.528 | 0.281 | 0.108 |
1__ | 0.914 | 0.543 | 0.233 |
_2_ | 1.150 | 0.696 | 0.335 |
__3 | 1.402 | 0.971 | 0.367 |
12_ | 1.525 | 0.938 | 0.450 |
1_3 | 1.809 | 1.192 | 0.507 |
_23 | 2.035 | 1.411 | 0.599 |
123 | 2.380 | 1.610 | 0.780 |
One can also use a Markov chain model to calculate the probability of scoring vs. not scoring over the rest of the inning from each base-out state. To do this the transition probabilities have to be modified by grouping all run-scoring transitions into a new state. The Markov chain can then calculate the probability of scoring in an inning as the probability of ever transitioning into this “runs scored” state.
# scoring probability, formula method score.prob = trans.prob score.prob = rbind(score.prob,0) score.prob = cbind(score.prob,0) rownames(score.prob)[33] = 'runs.scored' colnames(score.prob)[33] = 'runs.scored' for (r in 1:32) { for (c in 1:32) { if (trans.runs[r,c] > 0) { score.prob[r,c] = 0 } } score.prob[r,33] = 1 - sum(score.prob[r,]) } score.prob[33,33] = 1 score.prob = score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob score.prob = cbind(1-score.prob[1:24,33],score.prob[1:24,33]) colnames(score.prob) = c('0 runs','1+ runs') |
Scoring Probability, Formula | |||
BASES | OUTS | 0 RUNS | 1+ RUNS |
___ | 0 | 72% | 28% |
1__ | 0 | 58% | 42% |
_2_ | 0 | 38% | 62% |
__3 | 0 | 15% | 85% |
12_ | 0 | 37% | 63% |
1_3 | 0 | 13% | 87% |
_23 | 0 | 14% | 86% |
123 | 0 | 13% | 87% |
___ | 1 | 83% | 17% |
1__ | 1 | 73% | 27% |
_2_ | 1 | 60% | 40% |
__3 | 1 | 34% | 66% |
12_ | 1 | 58% | 42% |
1_3 | 1 | 35% | 65% |
_23 | 1 | 32% | 68% |
123 | 1 | 33% | 67% |
___ | 2 | 93% | 7% |
1__ | 2 | 87% | 13% |
_2_ | 2 | 78% | 22% |
__3 | 2 | 74% | 26% |
12_ | 2 | 77% | 23% |
1_3 | 2 | 72% | 28% |
_23 | 2 | 74% | 26% |
123 | 2 | 68% | 32% |
To check these values we could look at the empirical run distribution by base-out state, or we could expand the simple run expectancy simulator from above to also calculate the full run distribution. I’ll post the code for both of these alternatives. First, for the empirical run distribution we need to pull some slightly different data from the Retrosheet database. Here’s the MySQL code for the data (which can be downloaded directly here):
SELECT e.START_BASES_CD AS BASES , e.OUTS_CT AS OUTS , e.EVENT_RUNS_CT + e.FATE_RUNS_CT AS RUNS , COUNT(*) AS PA FROM retrosheet.events e , retrosheet.non_partial_non_home_half_ninth_plus_innings i WHERE e.GAME_ID = i.GAME_ID AND e.INN_CT = i.INN_CT AND e.BAT_HOME_ID = i.BAT_HOME_ID AND e.BAT_EVENT_FL = 'T' AND e.YEAR_ID >= 1993 AND e.YEAR_ID <= 2010 GROUP BY BASES , OUTS , RUNS ; |
Then we can import the data into R and build on it:
bo.run.dist = read.csv("bo_run_distribution.csv", header = TRUE) # re & run distribution, empirical method run.dist.freq = array(0,c(24,1+max(bo.run.dist$RUNS))) for (r in 1:nrow(bo.run.dist)) { row.num = bo.run.dist$BASES[r] + 8*bo.run.dist$OUTS[r] + 1 col.num = bo.run.dist$RUNS[r] + 1 run.dist.freq[row.num,col.num] = bo.run.dist$PA[r] } runs = NULL for (r in 1:ncol(run.dist.freq)) { runs[r] = paste(r-1, 'runs') } rownames(run.dist.freq) = bo[1:24] colnames(run.dist.freq) = runs run.dist.prob = run.dist.freq run.exp = array(0,c(24,1)) for (r in 1:nrow(run.dist.prob)) { run.dist.prob[r,] = run.dist.prob[r,]/sum(run.dist.prob[r,]) run.exp[r] = weighted.mean(0:(ncol(run.dist.freq)-1),run.dist.freq[r,]) } run.dist = cbind(run.exp,run.dist.prob) colnames(run.dist)[1] = 're' |
Run Distribution, Empirical | ||||||||||
RUNS | ||||||||||
BASES | OUTS | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
___ | 0 | 71% | 15% | 7% | 3% | 2% | 1% | 0% | 0% | 0% |
1__ | 0 | 57% | 17% | 13% | 7% | 3% | 2% | 1% | 0% | 0% |
_2_ | 0 | 37% | 35% | 14% | 8% | 4% | 2% | 1% | 0% | 0% |
__3 | 0 | 15% | 54% | 15% | 8% | 4% | 2% | 1% | 0% | 0% |
12_ | 0 | 36% | 22% | 16% | 13% | 7% | 3% | 1% | 1% | 0% |
1_3 | 0 | 13% | 43% | 16% | 14% | 7% | 4% | 2% | 1% | 0% |
_23 | 0 | 14% | 26% | 31% | 15% | 8% | 4% | 2% | 1% | 0% |
123 | 0 | 13% | 26% | 21% | 14% | 13% | 7% | 3% | 1% | 1% |
___ | 1 | 83% | 10% | 4% | 2% | 1% | 0% | 0% | 0% | 0% |
1__ | 1 | 72% | 11% | 9% | 4% | 2% | 1% | 0% | 0% | 0% |
_2_ | 1 | 59% | 23% | 10% | 5% | 2% | 1% | 0% | 0% | 0% |
__3 | 1 | 34% | 48% | 10% | 5% | 2% | 1% | 0% | 0% | 0% |
12_ | 1 | 58% | 16% | 11% | 9% | 4% | 2% | 1% | 0% | 0% |
1_3 | 1 | 35% | 37% | 12% | 9% | 4% | 2% | 1% | 0% | 0% |
_23 | 1 | 31% | 28% | 22% | 10% | 5% | 2% | 1% | 0% | 0% |
123 | 1 | 33% | 25% | 16% | 11% | 9% | 4% | 2% | 1% | 0% |
___ | 2 | 93% | 5% | 2% | 1% | 0% | 0% | 0% | 0% | 0% |
1__ | 2 | 87% | 6% | 5% | 2% | 1% | 0% | 0% | 0% | 0% |
_2_ | 2 | 77% | 15% | 5% | 2% | 1% | 0% | 0% | 0% | 0% |
__3 | 2 | 74% | 18% | 5% | 2% | 1% | 0% | 0% | 0% | 0% |
12_ | 2 | 76% | 11% | 6% | 5% | 2% | 0% | 0% | 0% | 0% |
1_3 | 2 | 72% | 15% | 6% | 5% | 2% | 1% | 0% | 0% | 0% |
_23 | 2 | 74% | 5% | 14% | 5% | 2% | 1% | 0% | 0% | 0% |
123 | 2 | 68% | 8% | 11% | 6% | 5% | 2% | 1% | 0% | 0% |
Alternately, here’s how to expand the run expectancy simulator to also give us the full run distribution:
# re & run distribution, simulation method innings = 10000 run.dist.simulation = array(0,c(24,1)) run.exp.simulation = array(0,c(24,1)) for (s in 1:24) { for (i in 1:innings) { s.state = s runs.inning = 0 while (s.state != 25) { e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,]) runs.inning = runs.inning + as.numeric(trans.runs[s.state,e.state]) s.state = e.state } if (runs.inning+1 > ncol(run.dist.simulation)) { for (l in ncol(run.dist.simulation):runs.inning) { run.dist.simulation = cbind(run.dist.simulation,array(0,c(24,1))) } } run.dist.simulation[s,runs.inning+1] = run.dist.simulation[s,runs.inning+1] + 1 } run.exp.simulation[s] = weighted.mean(0:(ncol(run.dist.simulation)-1),run.dist.simulation[s,]) } for (r in 1:24) { run.dist.simulation[r,] = run.dist.simulation[r,]/sum(run.dist.simulation[r,]) } runs = NULL for (r in 1:ncol(run.dist.simulation)) { runs[r] = paste(r-1, 'runs') } run.dist.simulation = cbind(run.exp.simulation,run.dist.simulation) rownames(run.dist.simulation) = bo[1:24] colnames(run.dist.simulation) = c('re',runs) |
Run Distribution, Simulation | ||||||||||
RUNS | ||||||||||
BASES | OUTS | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
___ | 0 | 72% | 14% | 7% | 3% | 2% | 1% | 0% | 0% | 0% |
1__ | 0 | 57% | 17% | 13% | 7% | 3% | 2% | 1% | 0% | 0% |
_2_ | 0 | 38% | 34% | 14% | 8% | 4% | 2% | 1% | 0% | 0% |
__3 | 0 | 15% | 56% | 14% | 8% | 4% | 2% | 1% | 0% | 0% |
12_ | 0 | 37% | 22% | 16% | 13% | 7% | 3% | 2% | 1% | 0% |
1_3 | 0 | 13% | 43% | 16% | 13% | 8% | 3% | 2% | 1% | 0% |
_23 | 0 | 14% | 27% | 30% | 14% | 8% | 4% | 2% | 1% | 0% |
123 | 0 | 13% | 27% | 21% | 14% | 14% | 6% | 3% | 1% | 1% |
___ | 1 | 83% | 10% | 4% | 2% | 1% | 0% | 0% | 0% | 0% |
1__ | 1 | 73% | 11% | 9% | 4% | 2% | 1% | 0% | 0% | 0% |
_2_ | 1 | 61% | 23% | 10% | 4% | 2% | 1% | 0% | 0% | 0% |
__3 | 1 | 34% | 49% | 9% | 5% | 2% | 1% | 0% | 0% | 0% |
12_ | 1 | 58% | 16% | 11% | 9% | 4% | 2% | 0% | 0% | 0% |
1_3 | 1 | 35% | 38% | 11% | 9% | 4% | 2% | 1% | 0% | 0% |
_23 | 1 | 31% | 28% | 22% | 10% | 5% | 2% | 1% | 0% | 0% |
123 | 1 | 33% | 25% | 16% | 10% | 9% | 4% | 2% | 1% | 0% |
___ | 2 | 93% | 5% | 2% | 1% | 0% | 0% | 0% | 0% | 0% |
1__ | 2 | 87% | 6% | 5% | 1% | 0% | 0% | 0% | 0% | 0% |
_2_ | 2 | 78% | 15% | 5% | 2% | 0% | 0% | 0% | 0% | 0% |
__3 | 2 | 74% | 19% | 4% | 2% | 1% | 0% | 0% | 0% | 0% |
12_ | 2 | 77% | 11% | 6% | 5% | 1% | 1% | 0% | 0% | 0% |
1_3 | 2 | 73% | 15% | 6% | 4% | 2% | 1% | 0% | 0% | 0% |
_23 | 2 | 73% | 5% | 15% | 5% | 2% | 1% | 0% | 0% | 0% |
123 | 2 | 68% | 8% | 11% | 6% | 4% | 1% | 1% | 0% | 0% |
So what’s the point of all these Markov chains and simulations if the results just mirror the empirical values? For one thing, they allow you to explore what would happen if some of the background conditions were to change. And sometimes they can fill in more stable values when the sample sizes of the empirical data are too small to be reliable. This isn’t an issue with run expectancy, but it can be with win expectancy.
Here’s all the R code from this post in one file (and here are the two CSV data files – bo_transitions.csv and bo_run_distribution.csv).
August 15th, 2011 at 11:49 am
Glad to see this series return! Could we see a follow-up article about simulating far-out run environments in the future?
August 15th, 2011 at 1:31 pm
Thanks, Geoff. Yeah, that’s something I might look at in the future.
September 12th, 2011 at 11:42 pm
Awesome stuff!
October 22nd, 2011 at 10:43 am
Great work
March 18th, 2013 at 9:26 pm
I have coffee with a man who uses the Markov chain theory in the 3 digit lottery here in Michigan. He win approximately 10 times, or more, a month.