A simple runs per game (RPG) formula. Carl Morris Harvard University, Dept Statistics October 2002 (version v2, revisions to continue)We prove: RPG = runs/game = 27 * OBA/(1-OBA) - 9 * LOB, with LOB = L1 * P(B > 0) + L2 * P(B > 1) + L3 * P(B > 2) = left on base per inning, B = baserunners. The runs per game (RPG) formula here determines the average number of runs scored by a team of identical players in a 9 inning game. As written above, RPG is 9 times the difference between the expected number of batters that reach base in an inning and the expected number of an inning's baserunners who are left on base (LOB). The Markovian assumptions required for this formula include identical and constant batting ability for all 9 players, independence, and to keep the formula simple, no baserunning outs or advances on outs. RPG is a mathematically exact and provable expression derived from the structure of baseball. It therefore applies to Major League Baseball (MLB), to Little League baseball, and even to softball games. It applies to batting against a great pitcher when runs are scarce, and it applies to games at Coors Field, where runs abound. It will apply on the moon if they ever play baseball there. Calculating a baseball team's expected runs is not a new idea. Markovian models for baseball have been used for that at least since Lindsey (1959). Matrix calculations within this theory were used by Cover and Kielers to compute their Offensive Earned Run Average (1977). OERA is the expected number of runs that would be scored by a team of 9 batters that all have the same batting statistics of a specified player. "Earned" is part of OERA because bases on error are omitted. Others, e.g. M. Pankin and J. Sagarin, have published statistics similar to the OERA by using Markov models and Markov theory, but their procedures and formula have not been made public. Some analysts have used simulation methods to estimate the average number of runs. Various explicit approximations to the runs Major League (ML) teams score in average play are available, apparently obtained via data analytic methods, some via linear regression. The best known of these may be Bill James' Runs Created Average (RCA). Analyses of Markov models can be made with less restrictive baserunning assumptions than considered here, but they require a fast computer either to enable large sample simulations or to solve 24 linear equations for 24 unknowns. No elementary formula is possible for general analyses. The RPG formula here is a Markov solution for simplified assumptions chosen to permit explicit and straight-forward formulas. It is proved here via elementary probability methods that can be understood by talented high school math students. Besides the Markov assumptions that all batters have the same ability and that batting events are independent, we assume for the RPG calculation that baserunners don't make outs, or steal, or advance on outs or on wild pitches, and aren't eliminated by double plays, etc. The need for this will be clear in the proofs that follow that these assumptions. Among its various uses, the RPG formula here provides an explicit way to calculate Cover and Kielers' OERA. In order to distinguish that application of RPG from the RPG formula itself, I will denote it as RPG9 when used to make the Cover-Kieler calculation of expected runs when one player's statistics correspond to batters in all 9 positions.I first presented this RPG formula in a 1980 talk at an American Statistical Association meeting in Houston and have reintroduced it occasionally since, most recently and most specially at a 1998 Stanford conference honoring Professor Tom Cover (of Cover-Keilers). However, this is the first time it has been made widely available. The urgency for publishing it now is to use it to evaluate the amazing batting performance of Barry Bonds in 2002. RGP9 for Barry Bonds in 2002 was the highest ever, an average of 22.53 runs per game. This is noticeably better than his RPG9 of 17.15 achieved with the help of his 73 homers in 2001. It is the all-time record, eclipsing past bests of 18.91 by Ted Williams (1941) and 18.47 by Babe Ruth (1923). As much as Bonds' performance in 2002 has been praised, the baseball world seems not to have realized that from the perspective of 9 identical batters, Bonds' 2002 performance is the best ever. Sportswriter Alan Schwarz helped change that with his articles on the ESPN website on Sept 25, 2002, and in the New York Times (Oct 6, 2002).Of course great players can't actually play with 8 others who perform at their extraordinary levels -- not even in All-Star games (partly because they then must face All-Star pitchers). When on base averages (OBA) are unusually high, walks and short hits increase in value and home runs become less important (because high-scoring innings depend relatively more on getting many men on base and less on long hits). Bonds' 2002 OBA of .582 broke Ted William's all-time record, and hereached base with walks more than with hits. Of course pitchers wouldn't walk Bonds as often if another Bonds waits on deck, but then they would have to give Bonds many more pitches to hit. No one knows, if pitchers could not pitch around Bonds, whether each of the 9 Bonds would exceed 100 homers per 162 game season and bat over .400. We do know, though, that Bonds averaged 87 homers per 486 outs (486 is 1/9 of a team's annual 27*162 outs) in 2002, and 110 homers per 486 outs in 2001. For comparison, Alex Rodriguez, the 2002 ML leader with 57 homers, averaged 63 homers per 486 outs, and Jim Thome placed second to Bonds in 2002 with 74 homers per 486 outs.The number of runs that a team of 9 identical players would score is not a perfect measure of batting prowess, but perhaps it isn't too imperfect. A RPG9 calculation shows that a team with a .582 OBA (Bonds' 2002 OBA) whose on base successes consist entirely of walks (that occur at random times) would average 16.2 runs/game. An RPG9 of 16.2 would rank among a small (a dozen?) or so greatest player/season RPG9 values ever. A hypothetical player who walks 58.2% of the time (and makes outs otherwise) would be valuable to a team, but less so when batting 1/9 of the time with 8 average batters. Walks always have been insufficiently appreciated in baseball. As shown here by another RPG calculation, called "RPG(1/9)", a "58.2% walker" would add about .43 runs per game to an average team. That .43 is about what today's biggest offensive stars produce for their teams.The RPG9 calculation of the Cover-Kielers idea is only one interesting use of the RPG formula. We will deal with other RPG uses elsewhere, but let us now prove the RPG theorem. Notation and results:Let w, s, d, t, and h be the conditional probabilities, given that a batter reaches base successfully, of reaching base via walks (walks include HP = hit by pitch when HP is available), singles, doubles, triples, and homers. Then w + s + d + t + h = 1. Otherwise the batter is out and we let z = probability of an out. Thus z = 1 - OBA, and the on base average is OBA = (hits + walks)/(AB + walks + SF), where SF = sacrifice flies. Example. Bonds in 2002 had 403 AB (official at bats), 149 hits, 31 doubles, 2 triples, 46 home runs, 198 walks (his record-breaking 68 intentional walks are included with his ordinary walks), 9 HP, and 2 SF. Thus Bonds had PA = 403 + 198 + 9 + 2 = 612. We include SFs in PA mainly because MLB does. Bonds' OBA = (149 + 198 + 9)/PA = 356/612 = .582 broke the all-time record. Since he made 256 outs = AB - Hits + SF = 403 - 149 + 2, Bonds' "on base odds" was OBO = on base per out = 356/256 = 1.391. This means Bonds reached base 37.55 times per 27 outs (the ML average is about 13.5 per 27 outs). Because at most 27 runners could be left on base in 9 innings, his RPG9 must exceed 10.5 (= 37.5 - 27) runs per game. Then w = (198 + 9)/356 = .581, s = (149 - 31 - 2 - 46)/356 = .197, d = 31/356 = .087, t = 2/356 = .006, and h = 46/356 = .129. These 5 values add to 1.000. The probability of a Bonds out was z = 1 - OBA = .418.Let B = the (random) number of baserunners in an inning. The probabilities that 0, 1, and 2 runners reach base in an inning are denoted p0, p1, and p2. Assuming independence, B follows a Negative Binomial distribution (the sum of 3 independent Geometric distributions), denoted B ~ NB(3, OBA). The mean (average number of runners per inning) is well-known to be: E(B) = 3 * OBA / (1 - OBA) = 3 * OBO. Then, p0 = P(B=0) = z^3, p1 = P(B=1) = 3 * OBA * z^3, p2 = P(B=2) = 6 * OBA^2 * z^3We show this for p2, the hardest case, and leave p0 and p1 to the reader. The probability that the first 2 batters reach base and the last 3 make outs is OBA^2 * z^3 (independence justifies the multiplications). The factor of 6 is needed in p2 because there are 6 orderings in which exactly 2 of the first 4 batters can reach base (the 5th and final batter cannot reach base when computing p2). For Bonds in 2002 we calculate p0 = .073, p1 = .128, and p2 = .149. Then P(B > 2) = 1 - .073 - .128 - .149 = .650 is the probability that at least 3 batters reach base in an inning. On average, E(B) = 3*OBO = 3*1.391 = 4.172 batters reach base per inning. That's 37.55 times on base per 27 outs for Bonds.Calculating the expected number of runners left on base, LOB, requires 3 values: L1, L2, and L3. Lj is the probability that j (= 1, 2, 3) baserunners do NOT score in an inning in which exactly j runners reach base. We see easily that L1 = 1 - hsince one baserunner cannot score unless the hit is a homer (no advances are allowed on outs).To derive L2 and L3, we first must separate singles into short, medium, and long singles and doubles into medium and long doubles. The probabilities of these 3 types of singles are denoted s1, s2, and s3. Then s = s1 + s2 + s3, with short singles (s1) advancing baserunners one base (e.g. infield singles), with medium singles (s2) scoring a runner from second but moving runners on first only to second base, and with long singles (s3) advancing all baserunners 2 bases. Doubles are separated into ordinary and long doubles, d = d1 + d2, with long (base-clearing) doubles having probability d2. These subdivisions rarely are available so we use s1 = s2 = s3 = s/3, d1 = 2*d/3, and d2 = d/3, values that are close to Major League averages. Cover and Keilers (1977) adopted the more dramatic convention that all singles and all doubles are "long", i.e. s3 = s and d2 = d. However, the theory allows for any appropriate choice.Now we can derive the probability L2 that two batters reach base in an inning without scoring, i.e. L2 = P(scoring 0 runs | B = 2). (The vertical bar "|" reads "given" in probability notation.) Then: L2 = (w + s) * (w + s + d1) + d * (w + s1) + t * w = L1 * w + (w + s) * (s + d1) + d * s1 L2 takes this form because the first successful batter can reach 1st base (probability = w + s) if the second successful batter does not advance him past 3rd base (w+s+d1), or the first batter can double (d) if the second batter walks or hits a short single (w + s1), or the first batter can triple if the second batter walks. Any other combination of 2 successes produces at least one run. The second and slightly shorter expression for L2 can be similarly reasoned, or shown algebraically.For L3 = P(scoring 0 runs | B = 3), the the first 2 batters must not score (probabilityL2) and then the 3rd successful batter must walk (w) or possibly hit a short single (s1). In the latter case the first 2 batters must only have reached first and second bases, the probability of which is (w + s) * (w + s1 + s2) + d * w. Thus: L3 = L2 * w + ((w + s) * (w + s1 + s2) + d * w) * s1For Bonds, we calculate L1 = .871, L2 = .710, and L3 = .453. That is, a team of 9 Bonds would leave the bases loaded 45.3% of innings in which at least 3 men reach base.Next we show that the expected number of runners left on base in an inning is: LOB = (1 - p0) * L1 + (1 - p0 - p1) * L2 + (1 - p0 - p1 - p2) * L3To do this we need the (well-known and easily proved) result that the expected value E(X) of a non-negative integer-valued random variable X can be written E(X) = P(X > 0) + P(X > 1) + P(X > 2) + P(X > 3) + more terms. For each value of B = the number of batters who reach base, we apply this identity to LOB. Luckily, LOB only takes 3 values: 0, 1, 2, and 3, so E(LOB | B) = P(LOB > 0 | B) + P(LOB > 1 | B) + P(LOB > 2 | B). Of course E(LOB | B = 0) = 0, trivially, and E(LOB | B = 1) = P(LOB = 1 | B = 1) = L1 = w + s + d + t = 1 - hby definition of L1. Since LOB > 2 is impossible when B = 2, we have E(LOB | B = 2) = P(LOB > 0 | B = 2) + P(LOB > 1 | B = 2) = L1 + L2.Now P(LOB > 0 | B = 2) = L1 because it doesn't matter when B = 2 whether the first runner scores or not, but only that the second fails to score, and that probability is L1. Of course P( LOB > 1 | B = 2) = P(LOB = 2 | B = 2) = L2, by definition. Finally, for any value of b = 3, 4, 5, ... E(LOB | B = b) = P(LOB > 0 | B = 3) + P(LOB > 1 | B = 3) + P(LOB > 2 | B = 3) = L1 + L2 + L3This is reasoned similarly. When b = 3 we have P(LOB > 0 | B = 3) = L1 since it doesn't matter whether the first 2 runners score or not, but only that the third runner does not score. That probability is L1. P(LOB > 1 | B = 3) = P(the last 2 runners do not score) = L2. Of course P(LOB > 2 | B = 3) = P(LOB = 3 | B = 3) = L3 by definition. Next note that the preceding formula also holds for any B > 3 because the first B - 3 runners have to score. So we only need to calculate E(LOB) for the last 3 runners and that was just done for B = 3.Since P(B > 0) = 1 - p0, P(B > 1) = 1 - p0 - p1, and P(B > 2) = 1 - p0 - p1 - p2, we get the more general-looking form of the LOB expression, as listed at the head of this article. The results just proved can be summarized as a Theorem.Runs Per game Theorem. LOB = (1 - p0) * L1 + (1 - p0 - p1) * L2 + (1 - p0 - p1 - p2) * L3 = L1 * P(B > 0) + L2 * P(B > 1) + L3 * P(B > 2). E(B) = 3 * OBO per inning RPG = E(Runs) = N * (E(B) - E(LOB)) if N innings are played.Of course we have used N = 9 in RPG throughout.Example. For Bonds 2002, using the values for p0, p1, p2 already calculated, we have P(B > 0) = .927, P(B > 1) = .799, and P(B > 2) = .650. Thus LOB = .871 * .927 + .710 * .799 + .453 * .650 = 1.669 runners per inning. The RPG Theorem says that 9 Bonds in 2002 would have averaged RPG = 9 * (3*OBO - LOB) = 9 * (3 * 1.391 - 1.669) = 22.53 runs/game. (Corrections are made here for round-off errors.) Did any team score over 22 runs even once in 2002? Even the all-time ML record of 29 runs in a game doesn't greatly exceed Bonds' RPG9 average. Discussion and applications:An easy to use RPG calculator is available at http://rpgcalc.butchwax.com Readers can do their own calculations there and they can use it to verify the RPG calculations done here based on the data in a table below.RPG ignores outs and advances on base. Bonds hit into 4 double plays and was caught stealing twice in 2002, both low values. Accounting for these 6 baserunning outs cannot be done exactly with the RPG9 formula, but doing accounting for these probably would reduce Bonds' RPG9 by about 5%. (This was estimated by converting 6 walks to 6 outs in the RPG9 calculation.) If Bonds has been thrown out on base (data unknown), that would lower his RPG9 further. On the other hand, accounting for times when Bonds reached base via errors (unavailable), for his 9 successful steals, and for his advances on outs (unknown) would increase his RPG9.We next consider typical American League team play, the AL being preferred to the NL for these calculations because having pitchers bat aggravates RGP distortions. The AL in 2001 scored 11013 runs in 20213 innings (8.92 innings/game), an average of 4.904 runs per 9 innings. About 4.4% of all outs were made on base (including double plays), i.e. about 1.2 runners/9 innings. It sometimes helps to remember that in a "typical" AL2001 game with 27 batting outs, a team ends with 37 AB, 10 hits, 2 doubles, 0.2 triples, and 1 homer, 4 walks, 0 HP, 0 SF, and 4.65 runs (via an RPG calculation).Based on data in the table below, RPG9 was 4.601 runs per 27 batting outs in average AL 2001 play. Multiplying RPG9 by 4.904/4.601 = 1.0658 (about 16/15) adjusts the theoretical RPG9 formula to the average runs actually the AL scored per 9 innings. The need for this adjustment is due partly to correcting for bases on errors, for advances on outs, and for outs on base. It also reflects the fact that managers bunch their best hitters together to maximize runs. Even though RPG9 misses some features of real baseball, if the factor (1.0658, say) were constant over many situations then RPG would rank players and teams properly, and if the correct factor actually were 1.0658, then RPG thusly adjusted would predict perfectly after multiplication by that constant. Bill James' Runs Created Average, simplified to account for the same inputs as RPG9 uses (ignoring steals, double plays, etc), becomes RCA = OBO * SL, where SL = slugging average. We multiply this by 21.5 because 21.5 * RCA = 4.607 gives the same prediction as RPG9. Applied to Bonds2002 this gives 21.5*RCA = 23.89, a value similar to Bonds' RPG9 of 22.55. Then 21.5*RCA calculates to 19.72, 19.65, and 19.41 for Bonds2001, Ruth1923, and Williams1941, values similar to their RPG9 values of 17.15, 18.18, and 18.91. Bonds broke the RCA record in 2002, not just the RPG9 record. Next to Bonds' 22.55 RPG9, the highest RPG9 values for 2002 ML players (batting in all 9 positions) were those of Jim Thome and Manny Ramirez. Their RPG9s were 11.14 and 11.00 -- just half of Bonds' RPG9. Thome's and Ramirez' 21.5*RCA values were close to their RPG9s: 11.69 and 11.37. The RPG formula can be used to approximate what would happen if a player bats 1/9 of the time on a team composed of 8 ordinary hitters and 1 special hitter. This will be denoted RPG(1/9). The 8 ordinary players arbitrarily are taken here to bat at the AL 2001 average, as in the table below. RPG9 is obtained by applying the RPG formula to a team that has 8/9 of its plate appearances with the average statistics of the AL 2001 and the other 1/9 of its PAs with the statistics of a particular player. (Elsewhere this will be fine-tuned to account for the batting position of the particular player.) Of course this RGP(1/9) calculation is only approximate because it is calculated as if all 9 players were of equal (average) ability, and they're not. However, the exact calculation would be much more advanced. Bonds' results may surprise those who think his walks would be much less valuable in a 1/9 context. As noted earlier, a player who walks 58.2% of the time (and makes outs the other 41.8%) adds .432 runs per game to an average team. (Note: Each "DIFF" in the table is the difference between 4.601 runs for an average team and the runs computed via RPG(1/9).) This contribution of .432 extra runs per game exceeds the average value of many of the game's offensive stars. Bonds in 2001 broke Ruth's 1923 record for RPG(1/9) by producing .941 additional runs more than an average player, per game played. The baseball world missed that record entirely! Bonds broke the record again in 2002, adding 1.004 extra runs per game to an average team, and again this extremely important measure of batting success went unnoticed. (Using another "benchmark" besides the AL2001 season would change the 1.004 a bit, but probably not enough to keep Bonds from having the 2 greatest seasons ever for this statistic.) Jim Thome ranked second in 2002 MLB by adding .572 extra runs per game to an average team (see the DIFF column in the table). Adding one extra run per game to a .500 team, as Bonds did, will average producing an extra 14 wins in a 162 game season. That's about the number of games that San Francisco finished over .500, and just enough to get them into the playoffs. For comparison, Thome's statistics applied over 162 games would add about 8.5 wins to a .500 team, and no one else was that high. Of course fairer comparisons would be made by position, and fairer comparisons would reflect the number of games played (Bonds missed 18?? games). Then players like Rodriguez would rise, shortstop typically being a weak-hitting position. Fairer statistics also would adjust for the difficulty of hitting in a player's home park, and that would help Bonds and others who played in 3-COM Park. These and other RPG issues will be considered elsewhere. For now, I end by saying that Bonds' batting performance in 2002, particularly his RPG9 and RPG(1/9) statistics, underscore his wide margin over other players as baseball's best hitter this year. They establish his 2002 performance as among the the all-time best seasons, and and it is a candidate for the best season ever. DATA TABLE AB HITS D T HR W + HP SF RPG9 RPG(1/9) DIFFBonds 2002 403 149 31 2 46 198 + 9 2 22.53 5.605 1.004Willms 1941 456 185 33 3 37 145 + 3 0* 18.91 5.479 .878Ruth 1923 522 205 45 13 41 170 + 4 0* 18.47 5.481 .880Bonds 2001 476 156 32 2 73 177 + 9 2 17.15 5.542 .941Bonds 2002 403 149 31 2 46 198 + 9 2 22.53 5.605 1.004Thome 2002 480 146 19 2 52 122 + 5 6 11.14 5.173 .572Ramirz 2002 436 152 31 0 33 73 + 8 1 11.00 5.146 .545Giles 2002 497 148 37 5 38 135 + 7 5 10.59 5.104 .503ARodrg 2002 624 187 27 2 57 87 + 10 4 8.49 4.995 .394AL 2001 78134 20852 4200 440 2506 7239 + 921 685 4.60 4.601 .000ALgame 2001 37 10 2 0.2 1 4 + 0 0 4.65 4.606 .005walker ** 418 0 0 0 0 582 + 0 0 16.20 5.033 .432 * Note: Sacrifice flies unavailable for Ruth and Williams, so 0 is entered.**Note: "walker" is a hypothetical player who walks 58.2% of the time andis out the other 41.8% (see text).DIFF is the difference between RPG(1/9) and 4.601, the AL2001 RPG. References: Cover, T. and Kielers (1977) James, Bill (19??) RCA (p) Lindsey, George (1959)Ladany and Machol, out of printMorris, Carl (2002): This RPG paper and updates at "Sports Articles": http://www.fas.harvard.edu/~stats/CNMorris/CNMorris.htmlNeyer, Rob and Barra, Allen June 1996 ESPN Mag, "SLOB" pp136-7RGP calculator. http://rpgcalc.butchwax.com/Schwarz, Alan (2002) "Managing with Markov", Harvard-Magazine May-June. Also, http://www.harvard-magazine.com/on-line/02mj/text/050221.html Schwarz, Alan (2002) ESPN website: http://www.espn.go.com/mlb/columns/schwarz_alan/1436689.htmlOther references and website(s)