2

I'm trying to figure out the best way how to use multiple variable filters in R.

Usually have up to 100 variables (in one condition) and need to filter cases where ANY of these variables satisfies the same condition (e.g. VARx=170). The names and numbers of variables often differ and are entered as a string to be evaluated. This is a bottleneck of my whole computation.

Example (filter Varx=37):

id <- c(1:100000) 
x1 <- sample(1:100, 100000, replace=T)
x2 <- sample(1:100, 100000, replace=T)
x3 <- sample(1:100, 100000, replace=T)
x4 <- sample(1:100, 100000, replace=T)
x5 <- sample(1:100, 100000, replace=T)
x6 <- sample(1:100, 100000, replace=T)
x7 <- sample(1:100, 100000, replace=T)
x8 <- sample(1:100, 100000, replace=T)
x9 <- sample(1:100, 100000, replace=T)
x10 <- sample(1:100, 100000, replace=T)

df<-data.frame(id,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)
dt<-data.table(df)

pm<-proc.time()
vys<-((x1==37) | (x2==37) | (x3==37) | (x4==37) | (x5==37) | (x6==37) | (x7==37) | (x8==37) | (x9==37) | (x10==37))
proc.time() - pm

pm<-proc.time()
vys<-((rowSums(subset(df,select=c(x1:x10))==37)>0))
proc.time() - pm

The first statement needs less time but is more difficult to prepare and longer. The second slower, yet more concise. I have tried to incorporate data.table in my computation but without success (i.e. without getting better computation times).

Do I miss a better way how to do this filtering?

(Changing the data structure or coding of the variables might be, of course, a solution. Still I would like to examine this kind of multiple filtering).

Martin
  • 165
  • 1
  • 9

3 Answers3

2

We could use Reduce with lapply

vys1 <-  dt[, Reduce('|', lapply(.SD, '==', 37)), .SDcols= x1:x10]
identical(as.vector(vys), vys1)
#[1] TRUE

Based on the same sort of benchmarks used

pm<-proc.time()
vys<-((x1==37) | (x2==37) | (x3==37) | (x4==37) | (x5==37) | (x6==37) | (x7==37) | (x8==37) | (x9==37) | (x10==37))
proc.time() - pm
#  user  system elapsed 
#  0.05    0.13    0.93 

pm<-proc.time()
vys1 <-  dt[, Reduce('|', lapply(.SD, '==', 37)), .SDcols= x1:x10]
proc.time() - pm
#   user  system elapsed 
#  0.05    0.03    0.08 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Is there a quick way to use different values for different variables? For example, if I want to filter as follows: ``x1 == 37 & x2 == 99`` – johnny May 23 '21 at 01:50
  • 1
    @johnny For that you can use `Map` i..e `Reduce('|', Map('==', .SD, c(37, 99)))` assuming `.SD` have only two columns x1, x2 – akrun May 23 '21 at 01:52
  • 1
    Wow, `Map` works exactly as I had hoped. Thank you so much! – johnny May 23 '21 at 01:59
1

I think your second statement in base R is OK, just try it with [ instead of subset:

rowSums(df[sprintf("x%d", 1:10)]==37) > 0

Benchmarks:

library(microbenchmark)
microbenchmark( times = 20, 
  subset = {((rowSums(subset(df,select=c(x1:x10))==37)>0))},
  dt_reduce = {dt[, Reduce('|', lapply(.SD, '==', 37)), .SDcols= x1:x10]},
  base_r = {rowSums(df[sprintf("x%d", 1:10)]==37) > 0}
)

# Unit: milliseconds
#       expr      min       lq     mean   median        uq       max neval
#     subset 82.74922 88.63819 99.69935 91.18369 110.24876 134.06550    20
#  dt_reduce 25.78002 28.62765 32.73945 28.89021  29.12712  71.25822    20
#     base_r 21.52504 24.27624 27.03380 25.83219  26.24400  65.38550    20
bergant
  • 7,122
  • 1
  • 20
  • 24
  • The lapply one is a bit slower than the base_r, it, however, enables to change '==' for '%in%' and use with a range of values as well. – Martin Nov 24 '15 at 15:11
0

You're looking for a function that works on every row of your dataframe. That's what "apply" is doing. It's equally fast as the solution of others, but easy to handle:

system.time(
((x1==37) | (x2==37) | (x3==37) | (x4==37) | (x5==37) | (x6==37) | (x7==37) | (x8==37) | (x9==37) | (x10==37))
 )
# user  system elapsed 
# 0.02    0.00    0.02 

 system.time(
    apply(df, 1 , function(x) any(x[2:11]==37))
 )
# user  system elapsed 
# 0.59    0.00    0.61 

Although you don't ask for changing data structure, I recommend have a look at tidy data. With a rearranged version of your dataframe you can do filterings efficient and easy to handle:

library(tidyr)
df2 = gather(df, key, value, -id)

system.time(
    select(filter(df, value==37), id)
)
#   user  system elapsed 
#   0.02    0.00    0.02
MarkusN
  • 3,051
  • 1
  • 18
  • 26