0

I am trying to understand the reducer.R code taken from the following website.

http://www.thecloudavenue.com/2013/10/mapreduce-programming-in-r-using-hadoop.html

This code is using for Hadoop Streaming using R.

I have given the code below:

    #! /usr/bin/env Rscript
    # reducer.R - Wordcount program in R
    # script for Reducer (R-Hadoop integration)

    trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)

    splitLine <- function(line) {
      val <- unlist(strsplit(line, "\t"))
      list(word = val[1], count = as.integer(val[2]))
    }

    env <- new.env(hash = TRUE)
    con <- file("stdin", open = "r")

    while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
     line <- trimWhiteSpace(line)
     split <- splitLine(line)
     word <- split$word
     count <- split$count

    if (exists(word, envir = env, inherits = FALSE)) {
      oldcount <- get(word, envir = env)
      assign(word, oldcount + count, envir = env)
      }
      else assign(word, count, envir = env)
      }
    close(con)

    for (w in ls(env, all = TRUE))
      cat(w, "\t", get(w, envir = env), "\n", sep = "")

Could someone explain the significance of the use of the following new.env command and the subsequent use of the env in the code:

    env <- new.env(hash = TRUE)

Why is this required? What happens if this is not included in the code?

Update 06/05/2014

I tried writing another version of this code without having a new environment defined and have given the code as follows:

    #! /usr/bin/env Rscript
    current_word <- ""
    current_count <- 0
    word <- ""

    con <- file("stdin", open = "r")

    while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) 
    {
      line1 <- gsub("(^ +)|( +$)", "", line)
      word <- unlist(strsplit(line1, "[[:space:]]+"))[[1]]
      count <- as.numeric(unlist(strsplit(line1, "[[:space:]]+"))[[2]])

      if (current_word == word) {
        current_count = current_count + count
      } else 
      {
    if(current_word != "")     
        {
           cat(current_word,'\t', current_count,'\n')
        }    
        current_count = count
        current_word = word
      }
    }

    if (current_word == word) 
    {
      cat(current_word,'\t', current_count,'\n')
    }

    close(con)

This code gives the same output as the one with a new environment defined.

Question: Does using new environment provide any advantages from a Hadoop standpoint? Is there a reason for using it in this specific case?

Thank you.

Ravi
  • 3,223
  • 7
  • 37
  • 49

1 Answers1

3

Your question is related with environment in R, example code for make new environment in R

> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a"   "foo"

I think this article can help you http://www.r-bloggers.com/environments-in-r/ or press

?environment

for more help

Like on code that you give, the author make a new environmnt.

 env <- new.env(hash = TRUE)

when he want to assign value they defined the environment

assign(word, oldcount + count, envir = env)

And for the question "What happens if this is not included in the code?" I think you can find the answer on the link that I already provided

About the advantages using new env in R is already answered in this link

so the reason is in this case you will play with the large of dataset, when you passing your dataset to the function, R will make a copy your dataset and then the return data will overwrite the old dataset. But if you passing env, R will directly process that env without copying large dataset.

Community
  • 1
  • 1
rischan
  • 1,553
  • 13
  • 19
  • Thanks, rischan. The link you provide helped. I have posted a different version of the code without defining a new environment and the code seems to work fine. However, I am still unable to figure out the advantage of using a new environment. – Ravi Jun 05 '14 at 14:35
  • @Ravi I already edit the answer, about the advantage using new env, pls see the link, thank you :) – rischan Jun 05 '14 at 16:43
  • 1
    Thanks rischan, The link helps – Ravi Jun 10 '14 at 07:45