R Hadoop counting

Question

I'm new in R, and i've a problem with MapReduce rmr2. I've a file to read of this kind, where in each row, there is a date and some words (A,B,C..) :

2016-05-10, A, B, C, A, R, E, F, E
2016-05-18, A, B, F, E, E
2016-06-01, A, B, K, T, T, E, G, E, A, N
2016-06-03, A, B, K, T, T, E, F, E, L, T

and i want to obtain in output something like :

2016-05: A 3 
2016-05: E 4
2016-05: E 4

i've done the same question with java implementation, now i've to do the same in R code, but I've to figure out how to do my Reducer. There is a way to do some print inside my mapper and Reduce code, because using print command inside Mapper or Reduce, i obtain an error in RStudio

Sys.setenv(HADOOP_STREAMING = "/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.8.0.jar")
Sys.setenv(HADOOP_HOME = "/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_CMD = "/usr/local/hadoop/bin/hadoop") 

library(stringr)
library(rmr2)
library(stringi)
customMapper = function(k,v){
  #words = unlist(strsplit(v,"\\s"))
  #words = unlist(strsplit(v,","))
  tmp = unlist(stri_split_fixed(v, pattern= ",",n = 2))
  data = tmp[1]
  onlyYearMonth = unlist(stri_split_fixed(data, pattern= "-",n = 3))
  #print(words)
  words = unlist(strsplit(tmp[2],","))
  compositeK = paste(onlyYearMonth[1],"-",onlyYearMonth[2])
  keyval(compositeK,words)

}

customReducer = function(k,v) {
    #Here there are all the value with same date ??? 
    elementsWithSameDate = unlist(v)

    #defining something similar to java Map to use for counting elements in same date
    # myMap

    for(elWithSameDate in  elementsWithSameDate) {

      words = unlist(strsplit(elWithSameDate,","))
      for(word in words) {
        compositeNewK = paste(k,":",word)
        # if myMap contains compositeNewK
             # myMap (compositeNewK, 1 + myMap.getValue(compositeNewK))
        # else 
             #myMap (compositeNewK, 1)

      }
    }

    #here i want to transorm myMap in a String, containing the first 3 words with max occurrencies
    #fromMapToString = convert(myMap)
    keyval(k,fromMapToString)
}


wordcount = function(inputData,outputData=NULL){
  mapreduce(input = inputData,output = outputData,input.format = "text",map = customMapper,reduce = customReducer)
}


hdfs.data = file.path("/user/hduser","folder2")
hdfs.out  = file.path("/user/hduser","output1")

result = wordcount(hdfs.data,hdfs.out)

Why do you need this `rmr2` library? Hadoop streaming reads from standard in and writes to standard out.... In other words, you can do all this entirely without hadoop. `cat input.txt | mapper.r | sort -k1,1 | reducer.r` (taken from here http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) — OneCricketeer, Jul 10 '17 at 20:11
Does the mapper work? If you are looking for a HashMap Java equivalent, there are `hashes` in R. https://cran.r-project.org/web/packages/hashmap/README.html — OneCricketeer, Jul 10 '17 at 20:20
Actually i m Not sure about it, because the print function, seems have some problem to execute. I dont know how to log my methods — GIULIO, Jul 10 '17 at 21:01

R Hadoop counting

0 Answers0