R is a statistical programming language, and Hadoop is a way to do distributed computations. Put them together and you get RHIPE, a way to wield Hadoop from within R -- effectively, a tool that lets you divide up your big number-crunching calculations into small bits and have them magically go off to multiple computers where Statistics! Magic! happens, then recombine into an answer on the other side. (Unsurprisingly, this technique is called "divide and recombine."
I've elected to take the project track for my R programming class, because plunging in over my head on statistics (which I'm fuzzy on), Hadoop (I barely know what it is), and R (saw it for the first time last month) seems like a fun way to learn, and I'm continuing to take notes here so people can follow along.
Now, the disclaimer here is that it's going to be tough to replicate our environment. We have a Hadoop cluster set up by the stats department's grad students (thanks, Jeff and Yang!) to play with, so I did not need to do most of the steps when following the installation instructions for RHIPE, which you will need to do if you want to play along. Getting our environmental variables configured (installation step #4) was most of today, followed by starting RHIPE and running some tests to verify it was working. I'll walk you through what I did.
Install the prerequisites. Follow instructions for prerequisites 1-3 on https://www.datadr.org/install.html (I didn't have to do this as I had an already-working system.) Alternatively, you can download the virtual machine and skip this and the next step, though I'm not sure how that will affect you in terms of performance.
Set your environment variables. I put the following in my .bashrc; your mileage may vary.
# configuration provided by Jeff Li for RHIPE development
export LD_LIBRARY_PATH=/opt/R-2.15.1/src/src/main/:/opt/protobuf/lib
export HADOOP=/usr/lib/hadoop/
export HADOOP_BIN=$HADOOP/bin
export R_LIBS=$HOME/R_LIBS
If you edit your .bashrc (or other config file), don't forget to run source .bashrc
afterwards so that the new environment variables are applied. (It will not work if you don't do this.)
Install RHIPE. This is easy. As commanded by the install instructions, simply run
wget https://github.com/downloads/saptarshiguha/RHIPE/Rhipe_0.69.tar.gz
R CMD INSTALL Rhipe_0.69.tar.gz
If you're like me and needed to install RHIPE in your home directory because of permissions (I put it in a folder called R_LIBS in my homedir), do...
wget https://github.com/downloads/saptarshiguha/RHIPE/Rhipe_0.69.tar.gz
mkdir R_LIBS
R CMD INSTALL -l ~/R_LIBS Rhipe_0.69.tar.gz
Start R and start RHIPE. If you don't know how to start R, you should really read a basic R tutorial at this point because the rest of this will happen inside R. Once you've got an R prompt:
library("Rhipe")
rhinit()
And RHIPE is started. You have to do this every time you start R when you're working with RHIPE, so I found it useful to put both lines in my .First function (discussed at the end of this post).
.First library("Rhipe")
Sys.sleep(3)
rhinit()
}
It doesn't quite work; the library("Rhipe")
execution seems to run fine, but I still need to run rhinit()
manually each time. I get the error message Error in Rhipe:::.rhinit(errors, info, path, cleanup, bufsize, buglevel) : could not find function "read.table"
even after adding the Sys.sleep(3)
pause (thinking that it might be trying to go before the library is fully loaded), but I'll ask and see if there's something I'm missing.
Test out RHIPE. If the tests on https://www.datadr.org/doc/installation.html#tests are working, you're in good shape.
Play around with some RHIPE functions. The tests above should have created something at /tmp/x
which you can now practice copying, deleting, etc. Use this functions list as a reference, but not everything will be relevant -- so just try out these:
rhls("/tmp")
# ls for RHIPE, lists things in/tmp
rhcp("/tmp/x", "/tmp/x-cp")
# cp for RHIPE, copies/tmp/x
to/tmp/x-cp
rhdel("/tmp/x")
# rm for RHIPE, removes/tmp/x
rhcp()
is for things within the HDFS (highly distributed filesystem, a Hadoop thing that takes all the distributed places your data's being stored and makes one giant filesystem out of them -- my understanding is that it's sort of like LVM). If you want to copy something from outside the HDFS into the HDFS, you use rhput("/file/from/outside/HDFS", "/place/inside/HDFS/to/copy/to")
instead.