Can I use readLines in mapreduce job in Rhadoop?

Question

I'm trying to read text or gz file from HDFS and run a simple mapreduce job (actually only the map job) but got error which seems like the readLines part doesn't work. I'm seeking answers of whether I can use readLines function in mapreduce. ps. there is no problem if I just use readLines function to parse HDFS files outside of mapreduce job. Thanks.

counts <- function(path){
        ct.map <- function(., lines) {
        line <- readLines(lines)
        word <- unlist(strsplit(line, pattern = " "))
        keyval(word, 1)
    }

    mapreduce(
    input = path,
    input.format = "text",
    map = ct.map
        )
}
counts("/user/ychen/100.txt")

score 0 · Answer 1 · answered Jul 23 '15 at 03:20

0

Not like that - the mapping function expects dfs formatted data to come in. You could rewrite your function like this, formatting in the input step:

counts <- function(path){
  ct.map <- function(.,line) {
    word <- unlist(strsplit(line, split = " "))
    keyval(word, 1)
  }

  mapreduce(
    input = to.dfs(readLines(path)),
    map = function(k,v){ct.map(k,v)},
    reduce = function(k,v){keyval(k,length(v))}
  )
}
output<-from.dfs(counts("/user/ychen/100.txt"))

I also added in a reduce step, to sum the values.

answered Jul 23 '15 at 03:20

jeremycg

24,657
5
63
74

the same error received cuz readLines() can't connect to HDFS files. – chelsea Jul 24 '15 at 07:41
then try input = path directly if path is a valid hdfs path – piccolbo Sep 20 '15 at 19:55

Can I use readLines in mapreduce job in Rhadoop?

1 Answers1