Migrating to new blog

To the extent anyone has ever read this site (unlikely!) and cares about it (less likely!), I am moving to a new blog: http://blababoutball.wordpress.com.


Adventures in Big Data

When I decided to get “serious” about doing some baseball analysis I got myself a serious computer to do the heavy mathematical lifting: a prior-generation dual-processor Mac Pro. Of course I didn’t need it, but I also like to tinker with technology, and upgrading it from a pretty fast 8-core PC to a screaming 12-core machine was a fun project in itself.

Fast forward a few months and I am finally doing the kind of computational task I had in mind for this computer… basically an iterative calculation on a huge dataset, where the calculation is done row by row but each row can be done independently of all others, thus lending itself to parallelization of a form that R lets you do very easily.

I did some very careful tests on small data sets to figure out the “sweet spot” of how many cores I could deploy before RAM became a constraint (16 logical cores, or 8 actual), and came up with a good estimate of how long the computation would take when run on a whole year of data: 4.5 hours.

22 hours later, here I am waiting for it to finish. Still. I check the machine to make sure it’s still running 16 threads (it is), that the CPU is at about 2/3 capacity (it is), and it’s not buffering things to the hard drive (it isn’t).

So I inspect my code to see what I did wrong, and I find… a “<>” where a “=” should be. I’m running it on every year BUT the one I wanted.


OBP By Base-Out State

In response to this comment over at Tango’s blog:

Base Runners 2010-2014 1993-2009 1969-1992 1953-1968
1B 2B 3B 0 out 1 out 2 out 0 out 1 out 2 out 0 out 1 out 2 out 0 out 1 out 2 out
_ _ _ .32O .313 .315 .337 .330 .333 .327 .322 .325 .325 .317 .326
1B _ _ .342 .337 .321 .358 .356 .336 .347 .344 .323 .338 .342 .318
_ 2B _ .344 .338 .333 .350 .356 .354 .326 .335 .340 .318 .330 .340
1B 2B _ .329 .328 .309 .354 .342 .325 .337 .328 .311 .336 .332 .310
_ _ 3B .356 .366 .330 .366 .383 .352 .339 .359 .347 .333 .356 .348
1B _ 3B .351 .341 .326 .364 .356 .345 .339 .336 .320 .334 .328 .326
_ 2B 3B .346 .358 .321 .364 .357 .351 .331 .337 .333 .330 .343 .341
1B 2B 3B .331 .331 .303 .344 .341 .322 .337 .320 .309 .351 .321 .302

Note bases loaded and 2 out: across all four observed eras, this base/out state results in the lowest OBP.

Edit: updated to correct a query error.

Tools: Installing MySQL on OS X

If you’ve ever installed MySQL on Linux, you know how easy it is. If you’ve ever installed MySQL on OS X, you know what a terrible pain-in-the-ass it is. It’s a pain whether you install the latest package directly from mysql.org, or if you install via macports. In order to actually make it work there are about a half dozen post-installation steps you need to follow, none of which are documented anywhere reasonable.

Enter the lovely folks a Mac Mini Vault. They have created and published (via GitHub) a script that installs MySQL from beginning to end, along with several other useful scripts.

If you’re a Mac user and you want to use MySQL for sabermetric analysis, save yourself some headaches and use the script.

Situational Fastball Usage

In a comment on my THT article on pitch sequencing, MGL made the following observation:

One other thing that must be controlled for is game situation and that could be significantly affecting the results. For example, when the pitching team is ahead, especially way ahead, later in the game, the pitcher is more likely to throw a fastball on all pitches, more likely to throw a strike, etc. The batting team is more likely to be taking more pitches, etc.

Unsurprisingly given the source, this is absolutely correct!

To test MGL’s assertion, I computed the percentage of fastball variants (FF, FA, FT, FC, FS, SI in PITCHf/x) thrown by every pitcher from 2008-2014, broken out by platoon, inning, count, index (i.e. whether the pitch was the 1st, 2nd, 3rd, etc. of the PA), and run differential (i.e. how many runs ahead or behind their team was at the time). I then used the delta method to compare (a) the percentage of fastballs thrown in the 7th inning or later when the pitcher’s team was ahead by 4 or more runs, to (b) the percentage of fastballs thrown in the 7th inning or later when the run differential was between 1 and -1.

Here are some selected findings:

  • On the first pitch, pitchers threw fastballs at a 5.2% higher rate when far ahead
  • On the second pitch (ignoring count), pitchers threw fastballs at a 1.5% higher rate when far ahead
  • Irrespective of count or pitch index, pitchers threw fastballs at a 3.6% higher rate overall when far ahead
  • In 1-0 counts, pitchers threw fastballs at a 6.7% higher rate when far ahead
  • In 0-1 counts, pitchers threw fastballs at a 2.8% lower rate when far ahead

In deeper counts the sample sizes quickly get small, so I’ll stop at 1-0 and 0-1. But the results are definitive: pitchers lean more on their fastball late in games when their team is far ahead. The astute reader will note that pitchers threw fastballs less frequently in 0-1 counts when far ahead (this is also the case for 1-1 and 0-2 counts, in smaller samples), but that is not enough to offset the higher fastball rates in other counts.

In fact the effect shows up whenever the pitcher’s team is far ahead, not just late in games. In innings 4 through 6 pitchers threw fastballs at a 4.1% higher rate on the first pitch, and a 2.3.% higher rate overall, when far ahead than in close games.

So MGL’s observation is exactly right: to really do this kind of pitch-level analysis correctly, we need to control for much more than I did in my last article.