Optimizing ifelse on a large data frame

Question

I have a data frame df that looks like this:

the below line of code adds a new column and fills data accordingly.

df$Mean.Result1 <- ifelse(df[, "A"] > 0.05 & df[, "B"] > 0.05, "Equal", "")

I am using R with Splunk, and R in Splunk is not able to recognize the above format.

Is it right to do:

df.$Mean.Result1 <- ifelse(df.$A > 0.05 & df$B > 0.05, "Equal", "")

How are the two pieces of code different? Will it affect the the speed of computation? My actual dataset has around 500 million rows and 400 columns.

If you have 500 million rows it would be _much, much_ more efficient, both memory-wise and computation-wise, to do `df.$Mean.Result1 <- ifelse(df.$A > 0.05 & df.$B > 0.05, 1L, 0L)` and then `df.$Mean.Result1 <- factor(df.$Mean.Result1, levels=c(1L, 0L), labels=c("Equal", ""))`. You will reduce the size of your table drastically and all operations involving `df.$Mean.Result1` will be _much_ faster. Avoid strings as much as you can, R does not handle them efficiently. — asachet, Oct 05 '15 at 17:38
`df[, "A"]` is equivalent to `df$A`, not `df.$A`. The two pieces of code are different because one use variable `df` and the other `df.`. Using the first or the second form is strictly equivalent in terms of computation cost. — asachet, Oct 05 '15 at 17:41

score 3 · Accepted Answer · edited May 23 '17 at 11:45

3

There has been some discussion about how ifelse is not the best option for code where speed is an important factor. You might instead try:

df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]

To see what's going on here, let's break down the command. df$A > 0.05 & df$B > 0.05 returns TRUE if both A and B exceed 0.05, and FALSE otherwise. Therefore, (df$A > 0.05 & df$B > 0.05)+1 returns 2 if both A and B exceed 0.05 and 1 otherwise. These are used as indicates into the vector c("", "Equal"), so we get "Equal" when both exceed 0.05 and "" otherwise.

Here's a comparison on a data frame with 1 million rows:

# Build dataset and functions
set.seed(144)
big.df <- data.frame(A = runif(1000000), B = runif(1000000))
OP <- function(df) {
  df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "")
  df
}
josilber <- function(df) {
  df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
  df
}
all.equal(OP(big.df), josilber(big.df))
# [1] TRUE

# Benchmark
library(microbenchmark)
microbenchmark(OP(big.df), josilber(big.df))
# Unit: milliseconds
#              expr      min        lq      mean    median        uq      max neval
#        OP(big.df) 299.6265 311.56167 352.26841 318.51825 348.09461 540.0971   100
#  josilber(big.df)  40.4256  48.66967  60.72864  53.18471  59.72079 267.3886   100

The approach with vector indexing is about 6x faster in median runtime.

edited May 23 '17 at 11:45

Community

1
1

answered Oct 05 '15 at 17:51

josliber

43,891
12
98
133

thanks for the quick and crisp explanation, it is really really helpful. – kRazzy R Oct 05 '15 at 17:58
1

@josilber, very clever! – Jacob H Oct 05 '15 at 20:58
1

Very nice. @kRazzyR If memory is an issue, consider casting to factor, you will cut the size of the object by half. – asachet Oct 07 '15 at 12:03
@josilber ,could be kind enough to explain the line of code `df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]` – kRazzy R Oct 13 '15 at 22:48
1

@kRazzyR I've added a paragraph of explanation – josliber Oct 14 '15 at 02:46
@josilber thanks. Based on what you have explained I have tried this `set.seed(144) big.df <- data.frame(A = runif(1000000), B = runif(1000000)) OP <- function(df) { df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "") df } abc <- function(df) { if(df$A>0.05 && df$B>0.05) { df$Mean.Result1<-"Equal" }else { df$Mean.Result1<-"" } df } all.equal(OP(big.df), abc(big.df)) # Benchmark library(microbenchmark) microbenchmark(OP(big.df), abc(big.df)) ` – kRazzy R Oct 14 '15 at 21:57
though I get the results of the microbenchmark as *Unit: milliseconds expr min lq mean median uq max neval cld OP(big.df) 441.378891 471.13532 538.02689 481.7621 495.85562 3191.401 100 b abc(big.df) 9.891185 12.59788 42.31095 14.8395 15.16461 2738.666 100 a * After the line `all.equal . . .` R outputs *"Component “Mean.Result1”: 902737 string mismatches"* what does it mean how do I resolve it? – kRazzy R Oct 14 '15 at 22:01

Optimizing ifelse on a large data frame

1 Answers1