7

In the following example:

  small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints, 
    map = function(k, v) cbind(v, v^2))

The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.

Now I have a CSV file already stored in HDFS as

"hdfs://172.16.1.58:8020/tmp/test_short.csv"

How to get an object for it?

And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:

data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=","))
mydata=data$val

It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!

Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?

Please give me some advice, whatever!

Hao Huang
  • 221
  • 4
  • 16

2 Answers2

3

mapreduce(input = path, input.format = make.input.format(...), map ...)

from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already

piccolbo
  • 1,305
  • 7
  • 17
  • Oh, I didn't notice I can put the input format in mapreduce() argument! I read your wiki which said to.dfs and from.dfs are only used for small data and testing. Thank you for your help! – Hao Huang Aug 07 '13 at 18:49
0

You can do something like below:

r.file <- hdfs.file(hdfsFilePath,"r")
from.dfs(
    mapreduce(
         input = as.matrix(hdfs.read.text.file(r.file)),
         input.format = "csv",
         map = ...
))

Please give points and hope anybody find it useful.

Note: For details refer to the stackoverflow post :

How to input HDFS file into R mapreduce for processing and get the result into HDFS file

Community
  • 1
  • 1
somnathchakrabarti
  • 3,026
  • 10
  • 69
  • 92