Subset in list of tuples (multiple columns) with %in% clause

Question

I would like to subset data.table using list of tuples (multiple columns) from another data.table, but not sure how.

From subsetting using single column

DT1[col1 %in% DT2(col_1)]

what I tried was

DT1[c(col1, col2) %in% DT2(col_1, col_2)]

albeit not successful. The error is

i evaluates to a logical vector length 91369852 but there are 45684926
rows. Recycling of logical i is no longer allowed as it hides more
bugs than is worth the rare convenience. Explicitly use
rep(...,length=.N) if you really need to recycle.

Any ideas? If %in% is not the correct method, how would you solve this problem?

Could you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? — Emmanuel-Lin, Jul 03 '18 at 09:18

Emmanuel-Lin · Accepted Answer · 2018-07-03T17:07:18.447

1

What you are doing is making 2 boolean for each line, so you have this error and not performing what you which. So indeed %in% is not the way to do it.

You should make it into a double condition with a and:

I make a reproducible example:

DT1 = as.data.table(data.frame(col1 = c(1,2,3,2,5,1,3,3,1,2), 
                               col2 = c(3,4,5,4,3,4,5,3,4,5), 
                               col3 = c(1,2,3,4,5,6,7,8,9,10))) 

DT2 = as.data.table(data.frame(col1 = c(1,2,1,2,3,4,3,2,4,3), 
                               col2 = c(3,4,5,3,6,4,5,4,3,4), 
                               col3=c(11,12,13,14,15,16,17,18,19,20)))

EDIT: based on remark I correct my answer (This was more trick than what I thought).

I create a filter function which will help me check if there are any matches in DT2

filter <- function(x){
  any(x[1] == DT2[["col1"]] & x[2] == DT2[["col2"]])
}

I apply this function on each row of DT1

indexes = apply(DT1, 1, filter)

I filter

> DT1[indexes, ]
   col1 col2 col3
1:    1    3    1
2:    2    4    2
3:    3    5    3
4:    2    4    4
5:    3    5    7

edited Jul 03 '18 at 17:07

answered Jul 03 '18 at 09:27

Emmanuel-Lin

1,848
1
16
31

Hi Emmanuel. Thank you for the tips. I just realized that this approach will only return first match ! Try this `DT1 = as.data.table(data.frame(col1 = c(1,2,3,2,5,1,3,3,1,2), col2 = c(3,4,5,4,3,4,5,3,4,5), col3 = c(1,2,3,4,5,6,7,8,9,10))) DT2 = as.data.table(data.frame(col1 = c(1,2,1,2,3,4,3,2,4,3), col2 = c(3,4,5,3,6,4,5,4,3,4), col3=c(11,12,13,14,15,16,17,18,19,20))) DT1[col1 == DT2$col1 & col2 == DT2$col2]` , rows (2,4) and (3,5) should appear more than once – Kenny Jul 03 '18 at 15:58
You are right, I edited my answer. This is why providing at the begging reproducible exmaple helps to answer corretly. – Emmanuel-Lin Jul 03 '18 at 17:07
Thanks Emmanuel. The original example is huge and so cubersome that it would not allow us to detect this error. I think exceptions like this can only be reproduced if you are really looking for it. – Kenny Jul 04 '18 at 08:48
By the way, this method is not sustainable as it is, for my data table of 41 columns, 45million rows and 17GB in size. Any suggestion for optimizing the performance ? – Kenny Jul 05 '18 at 16:42

Subset in list of tuples (multiple columns) with %in% clause

1 Answers1