1

I have a data frame df that looks like this:

A B C 
1 2 3
2 5 6
3 8 9

the below line of code adds a new column and fills data accordingly.

df$Mean.Result1 <- ifelse(df[, "A"] > 0.05 & df[, "B"] > 0.05, "Equal", "")

I am using R with Splunk, and R in Splunk is not able to recognize the above format.

Is it right to do:

df.$Mean.Result1 <- ifelse(df.$A > 0.05 & df$B > 0.05, "Equal", "")

How are the two pieces of code different? Will it affect the the speed of computation? My actual dataset has around 500 million rows and 400 columns.

kRazzy R
  • 1,561
  • 1
  • 16
  • 44
  • 1
    If you have 500 million rows it would be _much, much_ more efficient, both memory-wise and computation-wise, to do `df.$Mean.Result1 <- ifelse(df.$A > 0.05 & df.$B > 0.05, 1L, 0L)` and then `df.$Mean.Result1 <- factor(df.$Mean.Result1, levels=c(1L, 0L), labels=c("Equal", ""))`. You will reduce the size of your table drastically and all operations involving `df.$Mean.Result1` will be _much_ faster. Avoid strings as much as you can, R does not handle them efficiently. – asachet Oct 05 '15 at 17:38
  • 3
    `df[, "A"]` is equivalent to `df$A`, not `df.$A`. The two pieces of code are different because one use variable `df` and the other `df.`. Using the first or the second form is strictly equivalent in terms of computation cost. – asachet Oct 05 '15 at 17:41

1 Answers1

3

There has been some discussion about how ifelse is not the best option for code where speed is an important factor. You might instead try:

df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]

To see what's going on here, let's break down the command. df$A > 0.05 & df$B > 0.05 returns TRUE if both A and B exceed 0.05, and FALSE otherwise. Therefore, (df$A > 0.05 & df$B > 0.05)+1 returns 2 if both A and B exceed 0.05 and 1 otherwise. These are used as indicates into the vector c("", "Equal"), so we get "Equal" when both exceed 0.05 and "" otherwise.

Here's a comparison on a data frame with 1 million rows:

# Build dataset and functions
set.seed(144)
big.df <- data.frame(A = runif(1000000), B = runif(1000000))
OP <- function(df) {
  df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "")
  df
}
josilber <- function(df) {
  df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
  df
}
all.equal(OP(big.df), josilber(big.df))
# [1] TRUE

# Benchmark
library(microbenchmark)
microbenchmark(OP(big.df), josilber(big.df))
# Unit: milliseconds
#              expr      min        lq      mean    median        uq      max neval
#        OP(big.df) 299.6265 311.56167 352.26841 318.51825 348.09461 540.0971   100
#  josilber(big.df)  40.4256  48.66967  60.72864  53.18471  59.72079 267.3886   100

The approach with vector indexing is about 6x faster in median runtime.

Community
  • 1
  • 1
josliber
  • 43,891
  • 12
  • 98
  • 133
  • thanks for the quick and crisp explanation, it is really really helpful. – kRazzy R Oct 05 '15 at 17:58
  • 1
    @josilber, very clever! – Jacob H Oct 05 '15 at 20:58
  • 1
    Very nice. @kRazzyR If memory is an issue, consider casting to factor, you will cut the size of the object by half. – asachet Oct 07 '15 at 12:03
  • @josilber ,could be kind enough to explain the line of code `df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]` – kRazzy R Oct 13 '15 at 22:48
  • 1
    @kRazzyR I've added a paragraph of explanation – josliber Oct 14 '15 at 02:46
  • @josilber thanks. Based on what you have explained I have tried this `set.seed(144) big.df <- data.frame(A = runif(1000000), B = runif(1000000)) OP <- function(df) { df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "") df } abc <- function(df) { if(df$A>0.05 && df$B>0.05) { df$Mean.Result1<-"Equal" }else { df$Mean.Result1<-"" } df } all.equal(OP(big.df), abc(big.df)) # Benchmark library(microbenchmark) microbenchmark(OP(big.df), abc(big.df)) ` – kRazzy R Oct 14 '15 at 21:57
  • though I get the results of the microbenchmark as *Unit: milliseconds expr min lq mean median uq max neval cld OP(big.df) 441.378891 471.13532 538.02689 481.7621 495.85562 3191.401 100 b abc(big.df) 9.891185 12.59788 42.31095 14.8395 15.16461 2738.666 100 a * After the line `all.equal . . .` R outputs *"Component “Mean.Result1”: 902737 string mismatches"* what does it mean how do I resolve it? – kRazzy R Oct 14 '15 at 22:01