R Programming: Divide & Recombine, and RHIPE (Hadoop in R)

Today's topic is Divide and Recombine in R.

Divide and Recombine breaks things into 3 stages: D(ivide) operations that break the giant dataset into subsets, W(ithin) operations that analyze each subset, and B(etween) operations that put them back together.

This means you're doing a bunch of stuff with parallelism, probably using something for parallel computation like Hadoop. One useful library for this is RHIPE, a go-between for R and Hadoop; it lets you work with Hadoop by programming entirely in R; Hadoop does D operations, R does the W and B operations. Shiny.

When you're splitting large datasets into small ones for D&R, there's a tradeoff; if you make your subsets large, your processors are chugging away on giant data subsets, which takes a long time and defeats the purpose of trying to split them into smaller bits in the first place. But if you make your subsets small, there are a lot of them, and it takes a long time to put them back together. So you want to get somewhere in the middle, where you have subsets that are small enough to compute efficiently, but few enough in number that they can be recombined efficiently as well.

This stuff matters because we've got people on this campus that crunch datasets that take 3 weeks to process. This means we have ridiculously awesome Linux clusters available on campus. (For instance, one of them is 5 machines, each with 2x Six Core Intel Xeon 2.4GHz processors and 64GB memory, 12x 2TB 7.2k RPM SATA disks in 3 RAID 10s each with their own independent channel out to processors... those are the R/Hadoop ones; the 2 R/RHIPE machines have 128GB memory and 16 cores, and... I don't even know what I would do with that sort of firepower, good grief. and that's just one of the clusters.)

One of the options for this class (which is breaking into multiple tracks based on experience with R) is to play with RHIPE and Hadoop. Prof. Cleveland says that folks who want to dive into RHIPE need to have a bunch of R experience. I have none. I should probably not do this, but... but... I am tempted, because -- I mean, it's a cool new FOSS project! Let me find out what the other tracks are, and then I'll decide.