4

I have a case where I will be running R code on a data that will be downloaded from Hadoop. Then, the output of the R code will be uploaded back to Hadoop as well. Currently, I am doing it manually and I would like to avoid this manual downloading/uploading process.

Is there a way I can do this in R by connecting to hdfs? In other words, in the beginning of the R script, it connects to Hadoop and reads the data, then in the end it uploads the output data to Hadoop again. Are there any packages that can be used? Any changes required in Hadoop server or R?

I forgot to note the important part: R and Hadoop are on different servers.

KTY
  • 709
  • 1
  • 9
  • 17
  • Can I ask why you want to download data from hdfs? In general with hadoop, the point should be to bring the computation to the data. Not saying there's never a scenario where you would want to do that, just curious as to your use case. – devmacrile Oct 09 '15 at 20:26
  • I am not familiar how I would run R functions on a data in hadoop without reading it in R first. – KTY Oct 09 '15 at 20:29
  • are you able to install R on your Hadoop servers? Downloading the data to your R server seems costly... – Andrew Moll Oct 09 '15 at 21:08
  • No, we are not able to install R on Hadoop server . The size of the data also won't be an issue. This will be done on a regular basis so we would like to just do everything in R if possible. – KTY Oct 09 '15 at 21:16
  • Is this an answer? http://stackoverflow.com/questions/17583846/failed-to-remotely-execute-r-script-which-loads-library-rhdfs – IRTFM Oct 09 '15 at 23:06
  • Why the tag rhadoop? – piccolbo Oct 11 '15 at 16:41
  • Any other tag suggestions? Or any other place where I can ask this question? – KTY Oct 12 '15 at 02:09

2 Answers2

0

Install the package rmr2 , you will have an option of from.dfs function which can solves your requirement of getting the data from HDFS as mentioned below:

input_hdfs <- from.dfs("path_to_HDFS_file",format="format_columns")

For Storing the results back into HDFS, you can do like write.table(data_output,file=pipe(paste('hadoop dfs -put -', path_to_output_hdfs_file, sep='')),row.names=F,col.names=F,sep=',',quote=F)

(or) You can use rmr2 to.dfs function to store back into HDFS.

0

So... Have you found a solution for this?

Some months ago I have stumbled upon the same situation. After fiddling around for some time with the Revolution Analytics packages, I couldn't find a way for it to work in the situation where R and Hadoop are on different servers.

I tried using webHDFS, which at the time worked for me. You can find an R package for webhdfs acess here

The package is not available on CRAN you need to run:

devtools::install_github(c("saurfang/rwebhdfs"))

(yeah... You will need the devtools package)

DanP
  • 3
  • 3