3

I am working with the R programming language.

I have the following dataset:

v <- c(1,2,3,4,5,6,7,8,9,10)

var_1 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

var_2 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

var_3 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

var_4 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

var_5 <- as.factor(sample(v, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

my_data = data.frame(var_1, var_2, var_3, var_4, var_5)

I also have another dataset of "conditions" that will be used for querying this data frame:

conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"))

My Question: I tried to run the following command to select rows from "my_data" based on the first row of "conditions" - but this returns an empty result:

my_data[my_data$var_1 %in% unlist(conditions[1,1]) &
            my_data$var_2 %in% unlist(conditions[1,2]), ]

[1] var_1 var_2 var_3 var_4 var_5
<0 rows> (or 0-length row.names)

I tried to look more into this by "inspecting" these conditions:

class(conditions[1,1])
[1] "character"

This makes me think that the "unlist()" command is not working because the conditions themselves are a "character" instead of a "list".

Is there an equivalent command that can be used here that plays the same role as the "unlist()" command so that the above statement can be run?

In general, I am trying to produce the same results as I would have gotten from this code - but keeping the format I was using above:

my_data[my_data$var_1 %in% c("1", "3", "4") &
            my_data$var_2 %in% c("5", "6"), ]

Thanks!

Reference: Selecting Rows of Data Based on Multiple Conditions

jay.sf
  • 60,139
  • 8
  • 53
  • 110
stats_noob
  • 5,401
  • 4
  • 27
  • 83

3 Answers3

3

Up front, "1,3,4" != 1. It seems you should look to split the strings using strsplit(., ",").

expected <- my_data[my_data$var_1 %in% c("1", "3", "4") & my_data$var_2 %in% c("5", "6"), ]
head(expected)
#     var_1 var_2 var_3 var_4 var_5
# 18      3     6     2     2     9
# 129     3     5     3     2     8
# 133     4     5     6     5     8
# 186     1     6     6    10    10
# 204     4     6     4     2     6
# 207     1     5     3     2     9

out <- my_data[do.call(`&`, 
  Map(`%in%`,
      lapply(my_data[,1:2], as.character), 
      lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
head(out)
#     var_1 var_2 var_3 var_4 var_5
# 18      3     6     2     2     9
# 129     3     5     3     2     8
# 133     4     5     6     5     8
# 186     1     6     6    10    10
# 204     4     6     4     2     6
# 207     1     5     3     2     9

Edit: update for new conditions: change do.call to Reduce:

conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"), cond_3 = c("4,6", "9"))
out <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
head(out)
#     var_1 var_2 var_3 var_4 var_5
# 133     4     5     6     5     8
# 186     1     6     6    10    10
# 204     4     6     4     2     6
# 232     1     5     6     5     8
# 332     3     6     6     5    10
# 338     1     5     6     3     6
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you so much for your answer! I am trying to see if your answer works if the number of columns in the "conditions" data frame was more than 2. For example: – stats_noob Apr 10 '22 at 05:33
  • conditions = data.frame(cond_1 = c("1,3,4", "4,5,6"), cond_2 = c("5,6", "7,8,9"), cond_3 = c("4,6", "9")) – stats_noob Apr 10 '22 at 05:33
  • out <- my_data[do.call(`&`, Map(`%in%`, lapply(my_data[,1:3], as.character), lapply(conditions, function(z) strsplit(z, ",")[[1]]))),] – stats_noob Apr 10 '22 at 05:33
  • Error in .Primitive("&")(var_1 = c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, : 3 arguments passed to '&' which requires 2 – stats_noob Apr 10 '22 at 05:33
  • Can the code that you provided work for multiple conditions stored in multiple columns? Thank you so much! – stats_noob Apr 10 '22 at 05:34
  • 1
    See my edit. Use `Reduce` instead. – r2evans Apr 10 '22 at 05:36
  • @ r2evans: this works now! I even added a 4 condition column and tried your code with 4 conditions (lapply(my_data[,1:4]) ... and everything is working! I wonder why the earlier code doesn't generalize as well compared to the edited code - do you have any idea why this might be happening? Thank you so much for all your help! – stats_noob Apr 10 '22 at 05:39
  • Well, first and foremost, `"1,3,4"` is not the same as what you have in your columns. It needs to be split into individual components to be compared to anything else. – r2evans Apr 10 '22 at 06:08
1

Too long for a comment, but since I answered your previous question, I noticed an important difference in how your conditions data frames look like. Perhaps that is the reason for your confusion.

I assume, you want/need lists in the cells but in this version you have strings that need to be split as @r2evans showed you in a great answer.

To create the conditions data frame that resembles more that one of your previous question you may use list2DF (also see this related answer).

conditions1 <- list2DF(list(cond_1=list(c(1, 3, 4), c(4, 5, 6)), 
                            cond_2=list(c(5, 6), c(7, 8, 9))))
conditions1
#    cond_1  cond_2
# 1 1, 3, 4    5, 6
# 2 4, 5, 6 7, 8, 9

where:

str(conditions1)
# 'data.frame': 2 obs. of  2 variables:
#   $ cond_1:List of 2
# ..$ : num  1 3 4
# ..$ : num  4 5 6
# $ cond_2:List of 2
# ..$ : num  5 6
# ..$ : num  7 8 9

Your conditions looks similar

conditions
#   cond_1 cond_2
# 1  1,3,4    5,6
# 2  4,5,6  7,8,9

but:

str(conditions)
# 'data.frame': 2 obs. of  2 variables:
# $ cond_1: chr  "1,3,4" "4,5,6"
# $ cond_2: chr  "5,6" "7,8,9"
jay.sf
  • 60,139
  • 8
  • 53
  • 110
1

Use the data generated in the Note at the end. It uses set.seed to make it reproducible and also uses a reduced number of rows as the point can be illustrated just as well with fewer.

1) sqldf Generate the condition string cond which for this example is "var_1 in (1,3,4) and var_2 in (5,6)" and then insert it into an SQL statement which we run.

library(sqldf)

nc <- ncol(conditions)    
s <- sprintf("%s in (%s)", names(my_data)[1:nc], conditions[1, ])
cond <- paste(s, collapse = " and ")

fn$sqldf("select * from my_data where $cond")
##   var_1 var_2 var_3 var_4 var_5
## 1     1     6     8     6     1
## 2     4     6    10     8     7

# check

my_data[my_data$var_1 %in% c("1", "3", "4") & my_data$var_2 %in% c("5", "6"), ]
##    var_1 var_2 var_3 var_4 var_5
## 11     1     6     8     6     1
## 17     4     6    10     8     7

2) subset Generate the appropriate condition cond string which in this case equals "var_1 %in% c(1,3,4) & var_2 %in% c(5,6)" and then parse and run subset on it using do.call.

nc <- ncol(conditions)
s <- sprintf("%s %%in%% c(%s)", names(my_data)[1:nc], conditions[1, ])
cond <- paste(s, collapse = " & ")

do.call("subset", list(my_data, parse(text = cond)))

##    var_1 var_2 var_3 var_4 var_5
## 11     1     6     8     6     1
## 17     4     6    10     8     7

Note

set.seed(123)

conditions <- data.frame(cond_1 = c("1,3,4", "4,5,6"), 
                         cond_2 = c("5,6", "7,8,9"))

n <- 10   # sample from 1:n
nr <- 25  # no of rows
nc <- 5   # no of columns

prob <- rep(0.1, n)
nms <- paste0("var_", 1:nc)  # column names

L <- Map(function(x) as.factor(sample(n, nr, TRUE, prob)), nms)
my_data <- as.data.frame(L)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341