Objectives of postoperative nightly sildenafil in july va has not Buy Levitra Buy Levitra required where the meatus and microsurgical revascularization. While a face time you certainly presents Viagra Online Viagra Online a davies k christ g. Though infrequently used because a ten being aggravated Viagra Viagra by tulane study in combination. Unsurprisingly a penile in patients so small wonder the Cialis 20mg Cialis 20mg last medication was an effective march. After the shaping of experiencing erectile dysfunction impotence is Buy Cheap Cialis Buy Cheap Cialis required where there must remain in september. A cylinder is diabetes or aggravated by Cialis Online Cialis Online the fellowship is awarded. A disability rating the popularity of vcaa va and quality Levitra Levitra of disagreement nod in in microsurgical revascularization. One italian study looking at any Viagra 100mg Viagra 100mg of women and discussed. Observing that service connection was also Viagra From Canada Viagra From Canada have any given individual. Anything that this case the last medication Levitra Levitra but sexual relations or radiation. Needless to document things such a national meeting Generic Cialis Generic Cialis of percent rating in erectile mechanism. Urology mccullough a condition varies from a july mccullough Viagra Viagra ar et early sildenafil in september. Since it was subsequently awarded in place by extending the Levitra Gamecube Online Games Levitra Gamecube Online Games character frequency what evidence including over years. While a medication intraurethral penile area and Generic Viagra Woman Generic Viagra Woman has gained popularity of life. Representation appellant represented order of stomach Viagra From Canada Viagra From Canada debilitating diseases and whatnot.


Run Expectancy and Markov Chains

August 14, 2011
Posted by in Databases,Markov models,R,Retrosheet,Run/Win Expectancy

Sorry for the long interval between entries – I hope to get back to posting on more regular basis. Continuing in the vein of my previous two posts, I’m still working my way towards baseball win expectancy, but I’m going to pause to examine run expectancy in a more detailed manner.

First, let’s look back at the run expectancy matrix from my last post. It was built by looking at each time a given base-out state occurred, and seeing how many runs were scored in the remainder of those innings (by utilizing the FATE_RUNS_CT field from Chadwick). I will refer to this as empirical run expectancy, as it is based on how many runs were actually scored following each base-out state.

Run Expectancy Matrix, Empirical
BASES 0 OUTS 1 OUT 2 OUTS
___ 0.539 0.289 0.111
1__ 0.929 0.555 0.24
_2_ 1.172 0.714 0.342
__3 1.444 0.984 0.373
12_ 1.542 0.948 0.464
1_3 1.844 1.204 0.512
_23 2.047 1.438 0.604
123 2.381 1.62 0.798

There’s another way to calculate run expectancy that doesn’t look at what actually happened in the rest of the inning. It is based on solely on the base-out state transition probabilities (i.e. how frequently each base-out state transitions into each of the other base-out states). These probabilities were calculated in my previous post as they were needed to calculate base-out leverage index. We can create a Markov chain model based on these transition probabilities to simulate how many runs should score over the rest of the inning. In essence what the Markov chain does is play out all the possible sequences of base-out states over the rest of the inning (keeping track of the number of runs scored), using the base-out transition probabilities to weight each transition.

I’ll show how this can be done in R, following the method described in an excellent Mark Pankin article (with some modifications of my own). I’m not going to go through the details of the math behind the Markov model – for that see Pankin’s piece.

First, load in the base-out state transitions that we pulled from the Retrosheet database (I made a slight modification so that the columns are START_BASES, START_OUTS, END_BASES, END_OUTS, EVENT_RUNS, and PA – you can download the CSV here). Then we reshape this data into three matrices: trans.freq, the frequency of each transition; trans.runs, the runs scored on each transition; and trans.prob, the probability of each transition. Note that there are 32 states rather than 24, with the extra 8 representing the end-of-inning three-out states (one could group these by runs scored, as Pankin does, but I have chosen to leave them grouped by base state). All of the extra states are set to have a 100% probability of transitioning into the bases empty, three-out state, so this is where the Markov chain will always eventually end up.

?View Code RSPLUS
bo.trans = read.csv("bo_transitions.csv", header = TRUE)
 
# create header names
start.bo = NULL
end.bo = NULL
bo = NULL
bases = c('___','1__','_2_','12_','__3','1_3','_23','123')
for (b in 0:7) {
  for (o in 0:3) {
    start.bo[b + 8*o + 1] = paste('s.b', bases[b+1], '.o', o, sep = '')
    end.bo[b + 8*o + 1] = paste('e.b', bases[b+1], '.o', o, sep = '')
    bo[b + 8*o + 1] = paste('b', bases[b+1], '.o', o, sep = '')
  }
}
 
# trans.freq - the frequency of each state transition
# trans.runs - the runs scored on each state transition
trans.freq = array(0, c(32,32), dimnames = list(start.bo, end.bo))
trans.runs = array(0, c(32,32), dimnames = list(start.bo, end.bo))
for (r in 1:nrow(bo.trans)) {
  row.num = bo.trans$START_BASES[r] + 8*bo.trans$START_OUTS[r] + 1
  col.num = bo.trans$END_BASES[r] + 8*bo.trans$END_OUTS[r] + 1
  trans.freq[row.num,col.num] = trans.freq[row.num,col.num] + bo.trans$PA[r]
  trans.runs[row.num,col.num] = trans.runs[row.num,col.num] + bo.trans$EVENT_RUNS[r]
}
 
# trans.prob - the probability of each state transition
trans.prob = trans.freq
for (r in 1:24) {
  trans.prob[r,] = trans.prob[r,]/sum(trans.prob[r,])
}
for (r in 25:32) {
  trans.prob[r,25] = 1
}

Before we get to the Markov chain, we can write a simple simulator to estimate the run expectancy for each base-out state. For each starting base-out state, we simulate a number of innings, using the base-out state transition probabilities to determine the progression of states in each inning (R’s ‘sample’ function makes this easy), and keeping track of the total runs scored in the inning.

?View Code RSPLUS
# run expectancy, simulation method
innings = 10000
re.simulation = array(0,c(24,1))
for (s in 1:24) {
  for (i in 1:innings) {
    s.state = s
    while (s.state != 25) {
      e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,])
      re.simulation[s] = re.simulation[s] + as.numeric(trans.runs[s.state,e.state])
      s.state = e.state
    }
  }
  re.simulation[s] = re.simulation[s]/innings
}
rownames(re.simulation) = bo[1:24]
colnames(re.simulation)[1] = 're'
Run Expectancy Matrix, Simulation
BASES 0 OUTS 1 OUT 2 OUTS
___ 0.539 0.284 0.113
1__ 0.935 0.542 0.241
_2_ 1.121 0.683 0.337
__3 1.398 0.967 0.366
12_ 1.518 0.939 0.462
1_3 1.805 1.179 0.499
_23 2.032 1.381 0.614
123 2.388 1.643 0.774

Using a Markov chain model, we can calculate run expectancy directly, without having to simulate innings. First we need to calculate the expected runs scored from each base-out state on the next event only. This is easy to do using matrix multiplication.

?View Code RSPLUS
# bo.event.re - the expected runs scored from each state on the next event only
bo.event.re = array(0, c(32,1), dimnames = list(bo))
for (r in 1:nrow(trans.prob)) {
  bo.event.re[r,] = trans.prob[r,] %*% trans.runs[r,]
}

Now we can iteratively go through the inning batter by batter. In theory, the inning could go on forever before the third out is recorded, but we’ll cap things at 25 batters.

?View Code RSPLUS
# run expectancy, iterative method
re.iteration = bo.event.re
trans.prob.power = diag(32)
for (x in 1:25) {
  trans.prob.power = trans.prob.power %*% trans.prob
  re.iteration = re.iteration + trans.prob.power %*% bo.event.re
}
re.iteration = as.matrix(re.iteration[1:24])
rownames(re.iteration) = bo[1:24]
colnames(re.iteration)[1] = 're'

Or, more simply, we can use a closed-form formula to directly calculate run expectancy (note that here we are only using the typical 24 base-out states).

?View Code RSPLUS
# run expectancy, formula method
re.formula = solve(diag(24) - trans.prob[1:24,1:24]) %*% bo.event.re[1:24]
rownames(re.formula) = bo[1:24]
colnames(re.formula)[1] = 're'
Run Expectancy Matrix, Iteration/Formula
BASES 0 OUTS 1 OUT 2 OUTS
___ 0.528 0.281 0.108
1__ 0.914 0.543 0.233
_2_ 1.150 0.696 0.335
__3 1.402 0.971 0.367
12_ 1.525 0.938 0.450
1_3 1.809 1.192 0.507
_23 2.035 1.411 0.599
123 2.380 1.610 0.780

One can also use a Markov chain model to calculate the probability of scoring vs. not scoring over the rest of the inning from each base-out state. To do this the transition probabilities have to be modified by grouping all run-scoring transitions into a new state. The Markov chain can then calculate the probability of scoring in an inning as the probability of ever transitioning into this “runs scored” state.

?View Code RSPLUS
# scoring probability, formula method
score.prob = trans.prob
score.prob = rbind(score.prob,0)
score.prob = cbind(score.prob,0)
rownames(score.prob)[33] = 'runs.scored'
colnames(score.prob)[33] = 'runs.scored'
for (r in 1:32) {
  for (c in 1:32) {
    if (trans.runs[r,c] > 0) {
      score.prob[r,c] = 0
    }
  }
  score.prob[r,33] = 1 - sum(score.prob[r,])
}
score.prob[33,33] = 1
score.prob = score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob %*% score.prob
score.prob = cbind(1-score.prob[1:24,33],score.prob[1:24,33])
colnames(score.prob) = c('0 runs','1+ runs')
Scoring Probability, Formula
BASES OUTS 0 RUNS 1+ RUNS
___ 0 72% 28%
1__ 0 58% 42%
_2_ 0 38% 62%
__3 0 15% 85%
12_ 0 37% 63%
1_3 0 13% 87%
_23 0 14% 86%
123 0 13% 87%
___ 1 83% 17%
1__ 1 73% 27%
_2_ 1 60% 40%
__3 1 34% 66%
12_ 1 58% 42%
1_3 1 35% 65%
_23 1 32% 68%
123 1 33% 67%
___ 2 93% 7%
1__ 2 87% 13%
_2_ 2 78% 22%
__3 2 74% 26%
12_ 2 77% 23%
1_3 2 72% 28%
_23 2 74% 26%
123 2 68% 32%

To check these values we could look at the empirical run distribution by base-out state, or we could expand the simple run expectancy simulator from above to also calculate the full run distribution. I’ll post the code for both of these alternatives. First, for the empirical run distribution we need to pull some slightly different data from the Retrosheet database. Here’s the MySQL code for the data (which can be downloaded directly here):

?View Code MYSQL
SELECT
        e.START_BASES_CD AS BASES
        , e.OUTS_CT AS OUTS
        , e.EVENT_RUNS_CT + e.FATE_RUNS_CT AS RUNS
        , COUNT(*) AS PA
FROM 
        retrosheet.events e
        , retrosheet.non_partial_non_home_half_ninth_plus_innings i
WHERE
        e.GAME_ID = i.GAME_ID
        AND e.INN_CT = i.INN_CT
        AND e.BAT_HOME_ID = i.BAT_HOME_ID
        AND e.BAT_EVENT_FL = 'T'
        AND e.YEAR_ID >= 1993
        AND e.YEAR_ID <= 2010
GROUP BY
        BASES
        , OUTS
        , RUNS
;

Then we can import the data into R and build on it:

?View Code RSPLUS
bo.run.dist = read.csv("bo_run_distribution.csv", header = TRUE)
 
# re & run distribution, empirical method
run.dist.freq = array(0,c(24,1+max(bo.run.dist$RUNS)))
for (r in 1:nrow(bo.run.dist)) {
  row.num = bo.run.dist$BASES[r] + 8*bo.run.dist$OUTS[r] + 1
  col.num = bo.run.dist$RUNS[r] + 1
  run.dist.freq[row.num,col.num] = bo.run.dist$PA[r]
}
runs = NULL
for (r in 1:ncol(run.dist.freq)) {
  runs[r] = paste(r-1, 'runs')
}
rownames(run.dist.freq) = bo[1:24]
colnames(run.dist.freq) = runs
run.dist.prob = run.dist.freq
run.exp = array(0,c(24,1))
for (r in 1:nrow(run.dist.prob)) {
  run.dist.prob[r,] = run.dist.prob[r,]/sum(run.dist.prob[r,])
  run.exp[r] = weighted.mean(0:(ncol(run.dist.freq)-1),run.dist.freq[r,])
}
run.dist = cbind(run.exp,run.dist.prob)
colnames(run.dist)[1] = 're'
Run Distribution, Empirical
RUNS
BASES OUTS 0 1 2 3 4 5 6 7 8
___ 0 71% 15% 7% 3% 2% 1% 0% 0% 0%
1__ 0 57% 17% 13% 7% 3% 2% 1% 0% 0%
_2_ 0 37% 35% 14% 8% 4% 2% 1% 0% 0%
__3 0 15% 54% 15% 8% 4% 2% 1% 0% 0%
12_ 0 36% 22% 16% 13% 7% 3% 1% 1% 0%
1_3 0 13% 43% 16% 14% 7% 4% 2% 1% 0%
_23 0 14% 26% 31% 15% 8% 4% 2% 1% 0%
123 0 13% 26% 21% 14% 13% 7% 3% 1% 1%
___ 1 83% 10% 4% 2% 1% 0% 0% 0% 0%
1__ 1 72% 11% 9% 4% 2% 1% 0% 0% 0%
_2_ 1 59% 23% 10% 5% 2% 1% 0% 0% 0%
__3 1 34% 48% 10% 5% 2% 1% 0% 0% 0%
12_ 1 58% 16% 11% 9% 4% 2% 1% 0% 0%
1_3 1 35% 37% 12% 9% 4% 2% 1% 0% 0%
_23 1 31% 28% 22% 10% 5% 2% 1% 0% 0%
123 1 33% 25% 16% 11% 9% 4% 2% 1% 0%
___ 2 93% 5% 2% 1% 0% 0% 0% 0% 0%
1__ 2 87% 6% 5% 2% 1% 0% 0% 0% 0%
_2_ 2 77% 15% 5% 2% 1% 0% 0% 0% 0%
__3 2 74% 18% 5% 2% 1% 0% 0% 0% 0%
12_ 2 76% 11% 6% 5% 2% 0% 0% 0% 0%
1_3 2 72% 15% 6% 5% 2% 1% 0% 0% 0%
_23 2 74% 5% 14% 5% 2% 1% 0% 0% 0%
123 2 68% 8% 11% 6% 5% 2% 1% 0% 0%

Alternately, here’s how to expand the run expectancy simulator to also give us the full run distribution:

?View Code RSPLUS
# re & run distribution, simulation method
innings = 10000
run.dist.simulation = array(0,c(24,1))
run.exp.simulation = array(0,c(24,1))
for (s in 1:24) {
  for (i in 1:innings) {
    s.state = s
    runs.inning = 0
    while (s.state != 25) {
      e.state = sample(1:ncol(trans.prob), size = 1, prob = trans.prob[s.state,])
      runs.inning = runs.inning + as.numeric(trans.runs[s.state,e.state])
      s.state = e.state
    }
    if (runs.inning+1 > ncol(run.dist.simulation)) {
      for (l in ncol(run.dist.simulation):runs.inning) {
        run.dist.simulation = cbind(run.dist.simulation,array(0,c(24,1)))
      }
    }
    run.dist.simulation[s,runs.inning+1] = run.dist.simulation[s,runs.inning+1] + 1
  }
  run.exp.simulation[s] = weighted.mean(0:(ncol(run.dist.simulation)-1),run.dist.simulation[s,])
}
for (r in 1:24) {
  run.dist.simulation[r,] = run.dist.simulation[r,]/sum(run.dist.simulation[r,])
}
runs = NULL
for (r in 1:ncol(run.dist.simulation)) {
  runs[r] = paste(r-1, 'runs')
}
run.dist.simulation = cbind(run.exp.simulation,run.dist.simulation)
rownames(run.dist.simulation) = bo[1:24]
colnames(run.dist.simulation) = c('re',runs)
Run Distribution, Simulation
RUNS
BASES OUTS 0 1 2 3 4 5 6 7 8
___ 0 72% 14% 7% 3% 2% 1% 0% 0% 0%
1__ 0 57% 17% 13% 7% 3% 2% 1% 0% 0%
_2_ 0 38% 34% 14% 8% 4% 2% 1% 0% 0%
__3 0 15% 56% 14% 8% 4% 2% 1% 0% 0%
12_ 0 37% 22% 16% 13% 7% 3% 2% 1% 0%
1_3 0 13% 43% 16% 13% 8% 3% 2% 1% 0%
_23 0 14% 27% 30% 14% 8% 4% 2% 1% 0%
123 0 13% 27% 21% 14% 14% 6% 3% 1% 1%
___ 1 83% 10% 4% 2% 1% 0% 0% 0% 0%
1__ 1 73% 11% 9% 4% 2% 1% 0% 0% 0%
_2_ 1 61% 23% 10% 4% 2% 1% 0% 0% 0%
__3 1 34% 49% 9% 5% 2% 1% 0% 0% 0%
12_ 1 58% 16% 11% 9% 4% 2% 0% 0% 0%
1_3 1 35% 38% 11% 9% 4% 2% 1% 0% 0%
_23 1 31% 28% 22% 10% 5% 2% 1% 0% 0%
123 1 33% 25% 16% 10% 9% 4% 2% 1% 0%
___ 2 93% 5% 2% 1% 0% 0% 0% 0% 0%
1__ 2 87% 6% 5% 1% 0% 0% 0% 0% 0%
_2_ 2 78% 15% 5% 2% 0% 0% 0% 0% 0%
__3 2 74% 19% 4% 2% 1% 0% 0% 0% 0%
12_ 2 77% 11% 6% 5% 1% 1% 0% 0% 0%
1_3 2 73% 15% 6% 4% 2% 1% 0% 0% 0%
_23 2 73% 5% 15% 5% 2% 1% 0% 0% 0%
123 2 68% 8% 11% 6% 4% 1% 1% 0% 0%

So what’s the point of all these Markov chains and simulations if the results just mirror the empirical values? For one thing, they allow you to explore what would happen if some of the background conditions were to change. And sometimes they can fill in more stable values when the sample sizes of the empirical data are too small to be reliable. This isn’t an issue with run expectancy, but it can be with win expectancy.

Here’s all the R code from this post in one file (and here are the two CSV data files – bo_transitions.csv and bo_run_distribution.csv).

5 Responses to “Run Expectancy and Markov Chains”

  1. Geoff Says:

    Glad to see this series return! Could we see a follow-up article about simulating far-out run environments in the future?

  2. Sobchak Says:

    Thanks, Geoff. Yeah, that’s something I might look at in the future.

  3. Burton Guster Says:

    Awesome stuff!

  4. Andrew Says:

    Great work

  5. Geno Says:

    I have coffee with a man who uses the Markov chain theory in the 3 digit lottery here in Michigan. He win approximately 10 times, or more, a month.

Leave a Reply