I'm having an issue with a CSV dataset in HDFS when performing MapReduce with rmr2.
With 1 file only the MapReduce works fine and no error is found, but when 2 or more datasets in the same folder the data starts to break and the results in starts to break down as can be seen below:
from line 16 onwards the error starts and goes until the end of file.
the MapReduce used is:
calc = mapreduce(
input="hdfs://127.0.0.1:8020/user/cloudera/flumeFinal",
input.format=make.input.format(format="csv", sep = ",",
col.names=col.names,stringsAsFactors=F),
map=function(k,lines){
k <- lines[2]
return(keyval(k,1))
},
reduce= function(k,lines) {
keyval(k,sum(lines))
Does anyone have ever faced a similar issue and can help with this?
Thanks, Bruno