I am trying to understand the reducer.R code taken from the following website.
http://www.thecloudavenue.com/2013/10/mapreduce-programming-in-r-using-hadoop.html
This code is using for Hadoop Streaming using R.
I have given the code below:
#! /usr/bin/env Rscript
# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "\t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "\t", get(w, envir = env), "\n", sep = "")
Could someone explain the significance of the use of the following new.env command and the subsequent use of the env in the code:
env <- new.env(hash = TRUE)
Why is this required? What happens if this is not included in the code?
Update 06/05/2014
I tried writing another version of this code without having a new environment defined and have given the code as follows:
#! /usr/bin/env Rscript
current_word <- ""
current_count <- 0
word <- ""
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
{
line1 <- gsub("(^ +)|( +$)", "", line)
word <- unlist(strsplit(line1, "[[:space:]]+"))[[1]]
count <- as.numeric(unlist(strsplit(line1, "[[:space:]]+"))[[2]])
if (current_word == word) {
current_count = current_count + count
} else
{
if(current_word != "")
{
cat(current_word,'\t', current_count,'\n')
}
current_count = count
current_word = word
}
}
if (current_word == word)
{
cat(current_word,'\t', current_count,'\n')
}
close(con)
This code gives the same output as the one with a new environment defined.
Question: Does using new environment provide any advantages from a Hadoop standpoint? Is there a reason for using it in this specific case?
Thank you.