2

I'm trying to read text or gz file from HDFS and run a simple mapreduce job (actually only the map job) but got error which seems like the readLines part doesn't work. I'm seeking answers of whether I can use readLines function in mapreduce. ps. there is no problem if I just use readLines function to parse HDFS files outside of mapreduce job. Thanks.

counts <- function(path){
        ct.map <- function(., lines) {
        line <- readLines(lines)
        word <- unlist(strsplit(line, pattern = " "))
        keyval(word, 1)
    }

    mapreduce(
    input = path,
    input.format = "text",
    map = ct.map
        )
}
counts("/user/ychen/100.txt")
jeremycg
  • 24,657
  • 5
  • 63
  • 74
chelsea
  • 21
  • 1

1 Answers1

0

Not like that - the mapping function expects dfs formatted data to come in. You could rewrite your function like this, formatting in the input step:

counts <- function(path){
  ct.map <- function(.,line) {
    word <- unlist(strsplit(line, split = " "))
    keyval(word, 1)
  }

  mapreduce(
    input = to.dfs(readLines(path)),
    map = function(k,v){ct.map(k,v)},
    reduce = function(k,v){keyval(k,length(v))}
  )
}
output<-from.dfs(counts("/user/ychen/100.txt"))

I also added in a reduce step, to sum the values.

jeremycg
  • 24,657
  • 5
  • 63
  • 74