An accurate simulator can be used to investigate efficacy of
different tactical options such as batting orders and stolen bases,
asess the effects of errors and gain insight into how improbable an
particular season outcome is.
The discrete nature of the outcome of an at bat is the key to
creating Monte Carlo simulations of the game. At any point in the
game, one of several outcomes is possible. These are chacterized by
their probability of occurring and a random number is used to choose
one outcome based on the probabilities.
Three strong assumptions have guided the design and implementation of
the simulator.
First, no attempt is made to include strategy or tactics in the
program. All types of events included in the simulator take place at
their season average rates. Given the intense tactical game projected
by major league baseball, I feel this is a very strong
assumption.
Second, I assume there is no correlation between outcomes of
successive at bats. Sports commentators focus on clutch hitting (or
the lack of it), hitting streaks and other short term trends.
However, statistical support for the notion of better hitting with a
runner in scoring position is weak. There is significantly more
support for a home and away difference in team performance. Still,
relying on season averages seems to be the simplest and best
simulator implementation strategy.
Third, no attempt is made to model individual players. Hitting is
modeled at the level of detail of the batting order using season
averages at each position. Base running is accomplished using team
averages. Pitching is summarized by the single parameter, allowed
runs per game. If individual players stats were to be used, the
simulator would have to implement a substitution strategy. This would
contradict the first assumption.
In the discussion of the simulator that follows reference will be
made to a large number of detailed statistics characterizing each
team. These performance parameters are derived from the full season
play-by-play descriptions for the leagues and are prepared by a
parsing
program.
Errors are implemented using defensive team averages. Two classes of
errors are used: those that allow the batter to advance to first and
those that allow runners on base to advance. Values used for an error
class are the sum of all categories that produce the same result. For
example to get the number of runner advancing errors used by the
simulator, the official runner advancing class of errors is added to
the number of balks, wild pitches and passed balls that the team
made. The implementation rational for this is if the result is the
same don't distinguish the event as a separate category.
Simulated games follow the flow of an actual game. The visiting teams
bats first and continues until three outs are made. The home team
follows and does the same. If the home team is ahead after the middle
of the ninth inning the game is won. When the score is tied after
nine innings, extra innings are played until the tie is broken. If
the home team wins in the ninth or in extra innings, only as much of
its last inning as is needed to win is played. Runs scored in the
home half of the last inning are credited according to the Major
League rules. Each team plays half its games as the home team and
half as the visiting team. While there is no home field advantage
programmed in the simulator, alternating in this way prevents biases
in the simulation statistics due to the rules for ending a game. A
complete set of games defined by actual season pairings defines a
simulated season.
A plate appearance is modeled by first determining if a stolen base
is possible. If one is possible, attempting to steal third and second
are considered separately in this order, a determination of the
result is made. The batting team season averages are used to
determine possible outcomes which are no attempt, success or caught
stealing. The simulator can create double steals, runners on first
and second stealing second and third on the same play. However these
are artificially rare as the simulator does not generate a true
double steal. Caught stealing rates include pick off plays at the
starting base. Stealing home occurs so infrequently that it was not
programmed into the simulator. Runner advancing errors can occur on
stolen base attempts.
Following the stolen base evaluation, an at bat is simulated. First,
a check is made for a defensive error allowing the batter to proceed
to first. Error rates are based on the fielding team statistics. If
an error occurs, all runners on base also advance one base. If no
error occurs, the most likely outcome, the batting outcome is
simulated. Events that can occur are walks, singles, doubles,
triples, home runs and outs. Separate set of probabilities are
maintained for each batting order position. Since probabilities for
these events must total 1.0 thus there is an implied "out" column
containing 1.0. Probabilities are based on all plate appearances, not
just official at bats so they are only proportional to batting
averages and on base percentages. The probability for an event is the
difference between the value in the column and the preceding column
(0 for home runs). These probabilities for a single team displayed as
an array follow:
Position Home Run Triple Double Single Walk
1: 0.0195 0.0255 0.0540 0.2099 0.2924
2: 0.0215 0.0246 0.0631 0.2277 0.3446
3: 0.0377 0.0425 0.0833 0.2358 0.3569
4: 0.0434 0.0450 0.0916 0.2428 0.3585
5: 0.0461 0.0510 0.0872 0.2155 0.3487
6: 0.0438 0.0489 0.1180 0.2580 0.3592
7: 0.0467 0.0536 0.0813 0.2388 0.3339
8: 0.0071 0.0160 0.0410 0.1925 0.2870
9: 0.0092 0.0129 0.0460 0.1526 0.2096
The first step in determining an at bat result is to generate a number that can be compared to the hitting probability table. A value, S, that represents the effects of the opposing pitching and defense in general is computed:
In (1) Lra is the league average for runs allowed per game. Tra is
the same quantity for the defensive team. When the team and league
runs allowed per game are the same, S = 1. If Tra < Lra, that is
pitching and overall defense are better than the league average, then
S > 1 which will decrease the probability of hits and walks.
Similarly, when Tra > Lra, S < 1 and the probability of getting
on base increases. The constant a determines the strength of this
change and is determined by minimizing the chi-square statistic for
team runs allowed evaluated for all the teams in a league. Separate
minimizations have been done for each season data set. While the
values determined for a differ slightly, the differences are
sufficiently small that a single value is used for all simulations: a
= 0.44 . Also, total runs allowed are used, not the more commonly
referenced "earned run". All runs count equally, earned or not, and
the intent of the simulation is to reproduce season results, not
choose between pitchers. This method is entirely empirical and does
not purport to represent the actual hitter - pitcher interaction.
Justification for it is the very small values of the runs allowed
chi-square statistic achieved. The value of S needs to be determined
for each team just once per game.
Given S, the quantity to be used to choose the at bat result is
computed:
The function randomf() returns a pseudo random number in the range
0.0 to 1.0. The scaled random number, R, is compared to the array of
probabilities (Table 1) to determine the result of the at bat. If R
is greater than the appropriate value in the walk column the at bat
result is an out. If not an out the same R is compared to the singles
column. Again, if greater than this value, a walk is the at bat
result. The interpretation of the entries is the probability of the
particular event added to the probabilities of the previous events.
Continuing in sequence, if R is less than the value under home run
column a home run is the result. The order of testing is done from
more likely to less likely outcomes (triples are slightly out of
order) to minimize the number of tests needed. When S > 1 a larger
fraction of the 0-1 range of the RNG is greater than the on base
event thresholds reducing the probability of getting on base. With
S< 1 the converse is true.
For each possible at bat result, appropriate and also conditional,
base running is done. Home runs are the simplest at bat outcome to
process. The batter and all runners on base score. Triples are almost
as easy. All base runners score and the batter goes to third. Doubles
present a slightly more complicated situation. Runners on second and
third score unconditionally. The slight chance (Table 2) fora runner
on second not scoring is ignored. There are three possibilities for a
runner on first: score, go to third or be out trying to score. The
choice is made using a random number to select one of the three
possibilities. Singles are processed the same general way although
there are more possibilities. The runner on third scores. Base
runners on first and second have probabilistic outcomes. Table 2
tabulates the overall advance patterns for the 1995 American League.
Team values are used for these quantities in the simulator except for
improbable events such as runner on first scoring (1-h) or being out
(1-x) after a single where the league average is used for all teams.
Read the headings as the starting base to the final base on the play.
An x indicates an out made while advancing from the specified base
and h indicates a score.
Errors may occur on any hit except the home run and advance all
runners one base. Error rates used are from defensive team statistics
and include all event types that can advance a base runner (fielding,
balk, wild pitch and passed ball).
lead runner advance on single 1-2 1-3 1-h 1x 2-2 2-3 2-h 2-x 3-3 3-h 3x 1951 951 23 36 19 687 1365 54 12 1629 1 runner on third, single, next runner advances 1-2 1-3 1-h 1-x 2-2 2-3 2-h 2-x 1008 525 22 33 8 232 460 18 lead runner advance on double ------------, on double play 1-3 1-h 1-x 2-3 2-h 2-x 3-h 2-2 2-3 3-3 3-h 497 321 28 10 636 1 463 5 153 10 82
An at bat yielding an out presents the greatest number of base running possibilities. The simulator does not generate different kinds of outs such as fly, ground or strike outs. The type of out is implicit in the base running choices made following the out. Force outs can occur at any base. For example, if there are runners on first and second, the runner at second is out. Double plays are possible if there are less than two outs in the inning. The only type of double play included in the simulator is the common runner on first out at second with the batter out at first ground double play. If the third out was not made, base runner advances on outs are possible. These include the sacrifices hits (advancing a runner a single base, both first to second and second to third are possible) and sacrifice flies (scoring a runner). Either of these can occur when the lead runner has not made an out. All of these possibilities are conditional with rates determined by individual team averages. The rates used include all runner advances of the specified kind, not just the officially tabulated sacrifices.
Establishing an event probability requires two quantities, the
number of times the event happened and the number of chances there
were for the particular event to occur. In some cases, all possible
outcomes can be counted. A runner on second following a single has
just four possibilities: stay on second, advance to third, score, or
be out trying to advance. Many other event possibilities require
evaluating particular runner configurations to determine if the event
can take place. A stolen second base requires that a runner be on
first and none on second. Hitting rates require the number of plate
appearances. While the latter quantity can be derived from the event
files, the former is not so easily available. The intent of the
simulator is to reproduce numbers of these events thus if the
simulator chances are different than actual season chances the
numbers of events would be different if rates were determined
entirely by actual season results. Therefore, to establish the
category chances in the simulator, including plate appearances, the
following iterative process has been used. The simulator identifies
the particular runner configurations corresponding to these events
and counts them. This is done by team. These counts become the basis
for the event probabilities in the next iteration.
Statistics for the simulation are accumulated at the season and
multiple season level. Adequately determining distributions requires
accumulating data from multiple seasons. Typically, 100 - 10000
seasons are simulated depending on the analysis being done.
The random number generator (RNG) is a key component of the
simulation. The one used is random() from the GNU software libraries
which provides a 31 bit mantissa. It was carefully evaluated using
the frequency, two and three number serial tests and run tests
described by Knuth (Knuth, D. E. "The Art of Computer Programming.
Vol. 2. Semi Numerical Algorithms", Addison-Wesley 1969). It
satisfactorly passes all these tests. The two and three number serial
tests correspond most closely to the useage of the RNG in the
simulator. A further test was to compare the simulation results using
random() with a second RNG, drand48() also from the GNU libraries.
This RNG uses a different algorithm and provides a number with a 48
bit mantissa. No differences besides expected statistical variations
were seen in spite of the much larger mantissa. The routine random()
is used because it is significantly faster.
Debugging a simulator is notoriously difficult. Distinguishing
between relatively rare events and program bugs is challenging. The
final series of tests performed were the generation of event files in
the Retrosheet, Inc format
from single season simulations. These files were then processed by
the events file analysis program. Lack of error messages from the
consistency checking done during event file processing gives
considerable additional confidence in the simulator.
1996 American League East Division
%wins\pl 1 2 3 4 5 wins std min max act a-s
BAL 0.542: 462 336 157 45 0 87.8 6.0 68 105 88 0.2
NYA 0.539: 409 351 183 57 0 87.3 6.3 65 111 92 4.7
BOS 0.498: 97 210 397 296 0 80.7 6.2 59 100 85 4.3
TOR 0.476: 32 103 263 602 0 77.1 6.4 52 103 74 -3.1
DET 0.289: 0 0 0 0 1000 46.8 5.6 28 65 53 6.2
1996 American League Central Division
%wins\pl 1 2 3 4 5 wins std min max act a-s
CLE 0.606: 639 324 32 5 0 97.6 6.3 74 120 99 1.4
CHA 0.581: 351 564 74 10 1 94.2 6.2 73 113 85 -9.2
MIL 0.485: 5 56 370 322 247 78.6 6.6 58 100 80 1.4
MIN 0.479: 3 40 297 343 317 77.7 6.5 55 97 78 0.3
KCA 0.468: 2 16 227 320 435 75.3 6.3 52 95 75 -0.3
1996 American League West Division
%wins\pl 1 2 3 4 wins std min max act a-s
TEX 0.595: 825 160 15 0 96.4 6.1 80 116 90 -6.4
SEA 0.545: 163 649 179 9 87.7 6.3 67 110 85 -2.7
OAK 0.485: 12 182 711 95 78.5 6.2 62 99 78 -0.5
CAL 0.412: 0 9 95 896 66.3 6.1 45 86 70 3.7
team wins chisq 3.102 prob 0.997
Overall results from applying the simulator to the the 1996
American League season are given in Table 4. Teams are indicated by a
three letter code representing their home city. %wins is the fraction
of simulated games won. The next five or four columns depending on
the division, represents the number of times a team finished in that
particular place in their division. "wins" is the average number of
wins during a season. "std" is the standard deviation of the wins and
is followed by the minimum and maximum number of wins during the
simulation. The column labeled "act" is the number of wins during the
actual season. Finally, "a-s" is the difference between actual and
averaged simulated wins. The sign of the result was chosen to be
positive when the actual season results had more wins than the
simulation average.
The last line in the table is the chi-square of the actual and
simulated wins using simulated wins as the expected number.
Probability is computed on the basis of 13 degrees of freedom. Other
seasons yield similar results.
Table 5 present simulated and actual season distributions of three
run related quantities, season team runs scored, team runs allowed
and runners left on base. The column gms is the number of games
played during the season. Simulated results are the average of 1000
seasons. The "std" columns obviously correspond to the simulated
results. Three chi-square values and resulting probabilities are also
given. The low chi-square value for the runs allowed category
reflects the minimization procedure outlined in the discussion of the
hitting model. In this case the probabilities are evaluated for 13
degrees of freedom. Simulations for the other three seasons produce
comparable results.
1996 runs scored runs allowed left on base team gms in sim sm-in std in sim sm-in std in sim sm-in BAL 162 949 942 -7 43.7 903 867 -36 40.8 1154 1175 21 BOS 162 928 913 -15 42.8 921 912 -9 42.3 1251 1273 22 CAL 161 762 790 28 38.7 943 945 2 44.7 1209 1169 -40 CHA 162 898 926 28 42.6 794 780 -14 39.0 1231 1252 21 CLE 161 952 961 9 44.4 769 768 -1 39.3 1224 1238 14 DET 162 783 735 -48 38.4 1103 1150 47 47.7 1040 1084 44 KCA 161 746 730 -16 36.4 786 784 -2 37.6 1117 1166 49 MIL 162 894 868 -26 41.8 899 895 -4 42.1 1198 1244 46 MIN 162 877 848 -29 41.3 900 882 -18 41.5 1194 1228 34 NYA 162 871 860 -11 42.1 787 788 1 38.4 1258 1277 19 OAK 162 861 859 -2 39.9 900 883 -17 41.0 1175 1209 34 SEA 161 993 959 -34 42.4 895 873 -22 43.7 1238 1277 39 TEX 162 928 957 29 43.6 799 783 -16 37.5 1253 1260 7 TOR 162 766 768 2 39.3 809 808 -1 39.6 1169 1185 16 ---- tot 2264 12208 12116 -92 159.3 12208 12116 -92 159.3 16711 17037 326 sim. runs scored chi-sq 9.81e+00 prob 0.709 sim. runs allowed chi-sq 5.39e+00 prob 0.966 sim. left on base chi-sq 1.18e+01 prob 0.546
The remaining distributions evaluated as tests on the simulator
are runs per game, runs per inning and game length in innings (Table
6).
Runs per game distribution
team 0 1 2 3 4 5 6 7 8 9 >=10
sim 76 160 232 274 288 273 239 197 154 115 258
act 79 162 248 280 289 260 237 166 142 116 85
x-sq 0.2 0.0 1.2 0.1 0.0 0.6 0.0 4.9 0.9 0.0
for league chisq: 23.7 prob: 0.127 (17 DOF)
Runs per inning distribution
0 1 2 3 4 5 6 7 8 9 >=10
sim 13870 3396 1614 789 369 163 71 30 13 5 4
act 14015 3211 1592 778 411 172 92 34 12 2 2
x-sq 1.5 10.1 0.3 0.1 4.7 0.5 6.3 0.6 0.1
for league chisq: 24.2 prob: 0.004 (9 DOF)<
Game length in innings distribution
9 10 11 12 13 14 15 16 17 >=18
sim 2065 99 49 25 13 6 3 2 1 1
act 2040 120 34 38 14 4 8 2 0 0
x-sq 0.3 4.3 4.7 6.5 0.1
for league chisq: 15.9 prob: 0.007 (5 DOF)
Table 6 tabulates league totals, not individual team results.
Simulated (sim), actual season (act) and individual chi-square (x-sq)
contributions are given for each of the three distributions. The
chi-square is only tabulated where there are > 5 runs or innings
depending on the distribution in the particular histogram bin. The
total degrees of freedom used in the probability calculation are also
given. The simulated values are the averages of a 1000 season
simulation.
Using the simulator to investigate batting order and the effects of
varying team parameters such as the relative value of stolen bases
and runners caught stealing requires sufficient season simulations to
produce a standard error about 5 times smaller than the smallest
change in season wins considered significant. Since the standard
error is the standard deviation divided by the square root of the
number of simulated seasons, it takes 1000 simulated seasons to
produce a standard error of 0.2 wins allowing win differences of 1 to
be considered significant. Ten thousand season simulations yield a
minimum significant difference of 0.3 wins per season.
The simulator was designed and coded for time performance. Typically,
a 14 team 162 game season simulation requires 217000 random number
evaluations and takes approximately 0.6 seconds on a Power Mac
6100/66 system. Even with this relatively quick simulation time, many
hours of computation are needed to provide repeatable results for
changes in strategy that might produce a 1 run per season difference.
The implementation language is C++. The simulator software compiles
and executes on the Macintosh as well as GNU and Silicon Graphics,
Inc. UNIX systems.
Back to the J. F. Jarvis baseball page.