Run Expectancy and Markov Chains

August 14, 2011
Posted by in Databases,Markov models,R,Retrosheet,Run/Win Expectancy

Sorry for the long interval between entries – I hope to get back to posting on more regular basis. Continuing in the vein of my previous two posts, I’m still working my way towards baseball win expectancy, but I’m going to pause to examine run expectancy in a more detailed manner.

First, let’s look back at the run expectancy matrix from my last post. It was built by looking at each time a given base-out state occurred, and seeing how many runs were scored in the remainder of those innings (by utilizing the FATE_RUNS_CT field from Chadwick). I will refer to this as empirical run expectancy, as it is based on how many runs were actually scored following each base-out state.

Run Expectancy Matrix, Empirical
BASES	0 OUTS	1 OUT	2 OUTS
___	0.539	0.289	0.111
1__	0.929	0.555	0.24
_2_	1.172	0.714	0.342
__3	1.444	0.984	0.373
12_	1.542	0.948	0.464
1_3	1.844	1.204	0.512
_23	2.047	1.438	0.604
123	2.381	1.62	0.798

There’s another way to calculate run expectancy that doesn’t look at what actually happened in the rest of the inning. It is based on solely on the base-out state transition probabilities (i.e. how frequently each base-out state transitions into each of the other base-out states). These probabilities were calculated in my previous post as they were needed to calculate base-out leverage index. We can create a Markov chain model based on these transition probabilities to simulate how many runs should score over the rest of the inning. In essence what the Markov chain does is play out all the possible sequences of base-out states over the rest of the inning (keeping track of the number of runs scored), using the base-out transition probabilities to weight each transition.

I’ll show how this can be done in R, following the method described in an excellent Mark Pankin article (with some modifications of my own). I’m not going to go through the details of the math behind the Markov model – for that see Pankin’s piece.

First, load in the base-out state transitions that we pulled from the Retrosheet database (I made a slight modification so that the columns are START_BASES, START_OUTS, END_BASES, END_OUTS, EVENT_RUNS, and PA – you can download the CSV here). Then we reshape this data into three matrices: trans.freq, the frequency of each transition; trans.runs, the runs scored on each transition; and trans.prob, the probability of each transition. Note that there are 32 states rather than 24, with the extra 8 representing the end-of-inning three-out states (one could group these by runs scored, as Pankin does, but I have chosen to leave them grouped by base state). All of the extra states are set to have a 100% probability of transitioning into the bases empty, three-out state, so this is where the Markov chain will always eventually end up.

^?View Code RSPLUS

bo.trans = read.csv("bo_transitions.csv", header = TRUE)
 
# create header names
start.bo = NULL
end.bo = NULL
bo = NULL
bases = c('___','1__','_2_','12_','__3','1_3','_23','123')
for (b in 0:7) {
  for (o in 0:3) {
    start.bo[b + 8*o + 1] = paste('s.b', bases[b+1], '.o', o, sep = '')
    end.bo[b + 8*o + 1] = paste('e.b', bases[b+1], '.o', o, sep = '')
    bo[b + 8*o + 1] = paste('b', bases[b+1], '.o', o, sep = '')
  }
}
 
# trans.freq - the frequency of each state transition
# trans.runs - the runs scored on each state transition
trans.freq = array(0, c(32,32), dimnames = list(start.bo, end.bo))
trans.runs = array(0, c(32,32), dimnames = list(start.bo, end.bo))
for (r in 1:nrow(bo.trans)) {
  row.num = bo.trans$START_BASES[r] + 8*bo.trans$START_OUTS[r] + 1
  col.num = bo.trans$END_BASES[r] + 8*bo.trans$END_OUTS[r] + 1
  trans.freq[row.num,col.num] = trans.freq[row.num,col.num] + bo.trans$PA[r]
  trans.runs[row.num,col.num] = trans.runs[row.num,col.num] + bo.trans$EVENT_RUNS[r]
}
 
# trans.prob - the probability of each state transition
trans.prob = trans.freq
for (r in 1:24) {
  trans.prob[r,] = trans.prob[r,]/sum(trans.prob[r,])
}
for (r in 25:32) {
  trans.prob[r,25] = 1
}

Before we get to the Markov chain, we can write a simple simulator to estimate the run expectancy for each base-out state. For each starting base-out state, we simulate a number of innings, using the base-out state transition probabilities to determine the progression of states in each inning (R’s ‘sample’ function makes this easy), and keeping track of the total runs scored in the inning.

^?View Code RSPLUS

# run expectancy, simulation method
innings = 10000
re.simulation = array(0,c(24,1))
for (s in 1:24) {
  for (i in 1:innings) {
    s.state = s
    while (s.state != 25) {
      e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,])
      re.simulation[s] = re.simulation[s] + as.numeric(trans.runs[s.state,e.state])
      s.state = e.state
    }
  }
  re.simulation[s] = re.simulation[s]/innings
}
rownames(re.simulation) = bo[1:24]
colnames(re.simulation)[1] = 're'

Run Expectancy Matrix, Simulation
BASES	0 OUTS	1 OUT	2 OUTS
___	0.539	0.284	0.113
1__	0.935	0.542	0.241
_2_	1.121	0.683	0.337
__3	1.398	0.967	0.366
12_	1.518	0.939	0.462
1_3	1.805	1.179	0.499
_23	2.032	1.381	0.614
123	2.388	1.643	0.774

Using a Markov chain model, we can calculate run expectancy directly, without having to simulate innings. First we need to calculate the expected runs scored from each base-out state on the next event only. This is easy to do using matrix multiplication.

^?View Code RSPLUS

# bo.event.re - the expected runs scored from each state on the next event only
bo.event.re = array(0, c(32,1), dimnames = list(bo))
for (r in 1:nrow(trans.prob)) {
  bo.event.re[r,] = trans.prob[r,] %*% trans.runs[r,]
}

Now we can iteratively go through the inning batter by batter. In theory, the inning could go on forever before the third out is recorded, but we’ll cap things at 25 batters.

^?View Code RSPLUS

# run expectancy, iterative method
re.iteration = bo.event.re
trans.prob.power = diag(32)
for (x in 1:25) {
  trans.prob.power = trans.prob.power %*% trans.prob
  re.iteration = re.iteration + trans.prob.power %*% bo.event.re
}
re.iteration = as.matrix(re.iteration[1:24])
rownames(re.iteration) = bo[1:24]
colnames(re.iteration)[1] = 're'

Or, more simply, we can use a closed-form formula to directly calculate run expectancy (note that here we are only using the typical 24 base-out states).

^?View Code RSPLUS

# run expectancy, formula method
re.formula = solve(diag(24) - trans.prob[1:24,1:24]) %*% bo.event.re[1:24]
rownames(re.formula) = bo[1:24]
colnames(re.formula)[1] = 're'

Run Expectancy Matrix, Iteration/Formula
BASES	0 OUTS	1 OUT	2 OUTS
___	0.528	0.281	0.108
1__	0.914	0.543	0.233
_2_	1.150	0.696	0.335
__3	1.402	0.971	0.367
12_	1.525	0.938	0.450
1_3	1.809	1.192	0.507
_23	2.035	1.411	0.599
123	2.380	1.610	0.780

One can also use a Markov chain model to calculate the probability of scoring vs. not scoring over the rest of the inning from each base-out state. To do this the transition probabilities have to be modified by grouping all run-scoring transitions into a new state. The Markov chain can then calculate the probability of scoring in an inning as the probability of ever transitioning into this “runs scored” state.

^?View Code RSPLUS

# scoring probability, formula method
score.prob = trans.prob
score.prob = rbind(score.prob,0)
score.prob = cbind(score.prob,0)
rownames(score.prob)[33] = 'runs.scored'
colnames(score.prob)[33] = 'runs.scored'
for (r in 1:32) {
  for (c in 1:32) {
    if (trans.runs[r,c] > 0) {
      score.prob[r,c] = 0
    }
  }
  score.prob[r,33] = 1 - sum(score.prob[r,])
}
score.prob[33,33] = 1
score.prob = score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob
score.prob = cbind(1-score.prob[1:24,33],score.prob[1:24,33])
colnames(score.prob) = c('0 runs','1+ runs')

Scoring Probability, Formula
BASES	OUTS	0 RUNS	1+ RUNS
___	0	72%	28%
1__	0	58%	42%
_2_	0	38%	62%
__3	0	15%	85%
12_	0	37%	63%
1_3	0	13%	87%
_23	0	14%	86%
123	0	13%	87%
___	1	83%	17%
1__	1	73%	27%
_2_	1	60%	40%
__3	1	34%	66%
12_	1	58%	42%
1_3	1	35%	65%
_23	1	32%	68%
123	1	33%	67%
___	2	93%	7%
1__	2	87%	13%
_2_	2	78%	22%
__3	2	74%	26%
12_	2	77%	23%
1_3	2	72%	28%
_23	2	74%	26%
123	2	68%	32%

To check these values we could look at the empirical run distribution by base-out state, or we could expand the simple run expectancy simulator from above to also calculate the full run distribution. I’ll post the code for both of these alternatives. First, for the empirical run distribution we need to pull some slightly different data from the Retrosheet database. Here’s the MySQL code for the data (which can be downloaded directly here):

^?View Code MYSQL

SELECT
        e.START_BASES_CD AS BASES
        , e.OUTS_CT AS OUTS
        , e.EVENT_RUNS_CT + e.FATE_RUNS_CT AS RUNS
        , COUNT(*) AS PA
FROM 
        retrosheet.events e
        , retrosheet.non_partial_non_home_half_ninth_plus_innings i
WHERE
        e.GAME_ID = i.GAME_ID
        AND e.INN_CT = i.INN_CT
        AND e.BAT_HOME_ID = i.BAT_HOME_ID
        AND e.BAT_EVENT_FL = 'T'
        AND e.YEAR_ID >= 1993
        AND e.YEAR_ID <= 2010
GROUP BY
        BASES
        , OUTS
        , RUNS
;

Then we can import the data into R and build on it:

^?View Code RSPLUS

bo.run.dist = read.csv("bo_run_distribution.csv", header = TRUE)
 
# re & run distribution, empirical method
run.dist.freq = array(0,c(24,1+max(bo.run.dist$RUNS)))
for (r in 1:nrow(bo.run.dist)) {
  row.num = bo.run.dist$BASES[r] + 8*bo.run.dist$OUTS[r] + 1
  col.num = bo.run.dist$RUNS[r] + 1
  run.dist.freq[row.num,col.num] = bo.run.dist$PA[r]
}
runs = NULL
for (r in 1:ncol(run.dist.freq)) {
  runs[r] = paste(r-1, 'runs')
}
rownames(run.dist.freq) = bo[1:24]
colnames(run.dist.freq) = runs
run.dist.prob = run.dist.freq
run.exp = array(0,c(24,1))
for (r in 1:nrow(run.dist.prob)) {
  run.dist.prob[r,] = run.dist.prob[r,]/sum(run.dist.prob[r,])
  run.exp[r] = weighted.mean(0:(ncol(run.dist.freq)-1),run.dist.freq[r,])
}
run.dist = cbind(run.exp,run.dist.prob)
colnames(run.dist)[1] = 're'

Run Distribution, Empirical
		RUNS
BASES	OUTS	0	1	2	3	4	5	6	7	8
___	0	71%	15%	7%	3%	2%	1%	0%	0%	0%
1__	0	57%	17%	13%	7%	3%	2%	1%	0%	0%
_2_	0	37%	35%	14%	8%	4%	2%	1%	0%	0%
__3	0	15%	54%	15%	8%	4%	2%	1%	0%	0%
12_	0	36%	22%	16%	13%	7%	3%	1%	1%	0%
1_3	0	13%	43%	16%	14%	7%	4%	2%	1%	0%
_23	0	14%	26%	31%	15%	8%	4%	2%	1%	0%
123	0	13%	26%	21%	14%	13%	7%	3%	1%	1%
___	1	83%	10%	4%	2%	1%	0%	0%	0%	0%
1__	1	72%	11%	9%	4%	2%	1%	0%	0%	0%
_2_	1	59%	23%	10%	5%	2%	1%	0%	0%	0%
__3	1	34%	48%	10%	5%	2%	1%	0%	0%	0%
12_	1	58%	16%	11%	9%	4%	2%	1%	0%	0%
1_3	1	35%	37%	12%	9%	4%	2%	1%	0%	0%
_23	1	31%	28%	22%	10%	5%	2%	1%	0%	0%
123	1	33%	25%	16%	11%	9%	4%	2%	1%	0%
___	2	93%	5%	2%	1%	0%	0%	0%	0%	0%
1__	2	87%	6%	5%	2%	1%	0%	0%	0%	0%
_2_	2	77%	15%	5%	2%	1%	0%	0%	0%	0%
__3	2	74%	18%	5%	2%	1%	0%	0%	0%	0%
12_	2	76%	11%	6%	5%	2%	0%	0%	0%	0%
1_3	2	72%	15%	6%	5%	2%	1%	0%	0%	0%
_23	2	74%	5%	14%	5%	2%	1%	0%	0%	0%
123	2	68%	8%	11%	6%	5%	2%	1%	0%	0%

Alternately, here’s how to expand the run expectancy simulator to also give us the full run distribution:

^?View Code RSPLUS

# re & run distribution, simulation method
innings = 10000
run.dist.simulation = array(0,c(24,1))
run.exp.simulation = array(0,c(24,1))
for (s in 1:24) {
  for (i in 1:innings) {
    s.state = s
    runs.inning = 0
    while (s.state != 25) {
      e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,])
      runs.inning = runs.inning + as.numeric(trans.runs[s.state,e.state])
      s.state = e.state
    }
    if (runs.inning+1 > ncol(run.dist.simulation)) {
      for (l in ncol(run.dist.simulation):runs.inning) {
        run.dist.simulation = cbind(run.dist.simulation,array(0,c(24,1)))
      }
    }
    run.dist.simulation[s,runs.inning+1] = run.dist.simulation[s,runs.inning+1] + 1
  }
  run.exp.simulation[s] = weighted.mean(0:(ncol(run.dist.simulation)-1),run.dist.simulation[s,])
}
for (r in 1:24) {
  run.dist.simulation[r,] = run.dist.simulation[r,]/sum(run.dist.simulation[r,])
}
runs = NULL
for (r in 1:ncol(run.dist.simulation)) {
  runs[r] = paste(r-1, 'runs')
}
run.dist.simulation = cbind(run.exp.simulation,run.dist.simulation)
rownames(run.dist.simulation) = bo[1:24]
colnames(run.dist.simulation) = c('re',runs)

Run Distribution, Simulation
		RUNS
BASES	OUTS	0	1	2	3	4	5	6	7	8
___	0	72%	14%	7%	3%	2%	1%	0%	0%	0%
1__	0	57%	17%	13%	7%	3%	2%	1%	0%	0%
_2_	0	38%	34%	14%	8%	4%	2%	1%	0%	0%
__3	0	15%	56%	14%	8%	4%	2%	1%	0%	0%
12_	0	37%	22%	16%	13%	7%	3%	2%	1%	0%
1_3	0	13%	43%	16%	13%	8%	3%	2%	1%	0%
_23	0	14%	27%	30%	14%	8%	4%	2%	1%	0%
123	0	13%	27%	21%	14%	14%	6%	3%	1%	1%
___	1	83%	10%	4%	2%	1%	0%	0%	0%	0%
1__	1	73%	11%	9%	4%	2%	1%	0%	0%	0%
_2_	1	61%	23%	10%	4%	2%	1%	0%	0%	0%
__3	1	34%	49%	9%	5%	2%	1%	0%	0%	0%
12_	1	58%	16%	11%	9%	4%	2%	0%	0%	0%
1_3	1	35%	38%	11%	9%	4%	2%	1%	0%	0%
_23	1	31%	28%	22%	10%	5%	2%	1%	0%	0%
123	1	33%	25%	16%	10%	9%	4%	2%	1%	0%
___	2	93%	5%	2%	1%	0%	0%	0%	0%	0%
1__	2	87%	6%	5%	1%	0%	0%	0%	0%	0%
_2_	2	78%	15%	5%	2%	0%	0%	0%	0%	0%
__3	2	74%	19%	4%	2%	1%	0%	0%	0%	0%
12_	2	77%	11%	6%	5%	1%	1%	0%	0%	0%
1_3	2	73%	15%	6%	4%	2%	1%	0%	0%	0%
_23	2	73%	5%	15%	5%	2%	1%	0%	0%	0%
123	2	68%	8%	11%	6%	4%	1%	1%	0%	0%

So what’s the point of all these Markov chains and simulations if the results just mirror the empirical values? For one thing, they allow you to explore what would happen if some of the background conditions were to change. And sometimes they can fill in more stable values when the sample sizes of the empirical data are too small to be reliable. This isn’t an issue with run expectancy, but it can be with win expectancy.

Here’s all the R code from this post in one file (and here are the two CSV data files – bo_transitions.csv and bo_run_distribution.csv).

5 Responses to “Run Expectancy and Markov Chains”

Geoff Says:
August 15th, 2011 at 11:49 am
Glad to see this series return! Could we see a follow-up article about simulating far-out run environments in the future?
Sobchak Says:
August 15th, 2011 at 1:31 pm
Thanks, Geoff. Yeah, that’s something I might look at in the future.
Burton Guster Says:
September 12th, 2011 at 11:42 pm
Awesome stuff!
Andrew Says:
October 22nd, 2011 at 10:43 am
Great work
Geno Says:
March 18th, 2013 at 9:26 pm
I have coffee with a man who uses the Markov chain theory in the 3 digit lottery here in Michigan. He win approximately 10 times, or more, a month.

chances is…

Run Expectancy and Markov Chains

5 Responses to “Run Expectancy and Markov Chains”

Leave a Reply

sports blogs

stat blogs

posts by month

posts by category

feeds, etc.

search