Adventures in Big Data

When I decided to get “serious” about doing some baseball analysis I got myself a serious computer to do the heavy mathematical lifting: a prior-generation dual-processor Mac Pro. Of course I didn’t need it, but I also like to tinker with technology, and upgrading it from a pretty fast 8-core PC to a screaming 12-core machine was a fun project in itself.

Fast forward a few months and I am finally doing the kind of computational task I had in mind for this computer… basically an iterative calculation on a huge dataset, where the calculation is done row by row but each row can be done independently of all others, thus lending itself to parallelization of a form that R lets you do very easily.

I did some very careful tests on small data sets to figure out the “sweet spot” of how many cores I could deploy before RAM became a constraint (16 logical cores, or 8 actual), and came up with a good estimate of how long the computation would take when run on a whole year of data: 4.5 hours.

22 hours later, here I am waiting for it to finish. Still. I check the machine to make sure it’s still running 16 threads (it is), that the CPU is at about 2/3 capacity (it is), and it’s not buffering things to the hard drive (it isn’t).

So I inspect my code to see what I did wrong, and I find… a “<>” where a “=” should be. I’m running it on every year BUT the one I wanted.