Subset data.table by another data.table without merging all columns

Question

With data.table, I'd like to subset a table according to another one.

DT1 <- data.table( A = c(1:5,1:2), B = c(1,1,1,2,2,2,3), C = c(1:3))
DT2 <- data.table( A = c(1, 2, 3), B = c(1,3,1), D = c(4:6))

The output should contain only columns of the first table.

With the merge function I get columns of both table

merge(DT1, DT2, by = c("A", "B"))

       A B C D
    1: 1 1 1 4
    2: 2 3 1 5
    3: 3 1 3 6

Since my datasets have much more columns, I'm looking for a way to get the result without the merge function, but with the data.table syntax [i, j, by]. However, with subsetting the first table it doesn't merge correctly.

DT1[A %in% DT2[,A] & B %in% DT2[, B]]

       A B C
    1: 1 1 1
    2: 2 1 2    < wrong
    3: 3 1 3
    4: 2 3 1

DT1[A == DT2[,A] & B == DT2[, B]]
       A B C
    1: 1 1 1
    2: 3 1 3
    Warning messages:
    1: In A == DT2[, A] :
      longer object length is not a multiple of shorter object length
    2: In B == DT2[, B] :
      longer object length is not a multiple of shorter object length

Is there a way to get the correct rows like with merge, but using the data.table Syntax [ ]?

score 2 · Answer 1 · answered May 04 '17 at 06:07

2

We do a join and then subset

nm1 <- names(DT1)
DT1[DT2, on = .(A, B)][, ..nm1]
#   A B C
#1: 1 1 1
#2: 2 3 1
#3: 3 1 3

Also, this is a case of semi_join from dplyr

dplyr::semi_join(DT1, DT2, by = c('A', 'B'))
#  A B C
#1 1 1 1
#2 2 3 1
#3 3 1 3

answered May 04 '17 at 06:07

akrun

874,273
37
540
662

1

Or `DT1[DT1[DT2, on=.(A,B), which=TRUE, nomatch=0]]` – Frank May 04 '17 at 06:10
1

@Frank Never thought about the `which` route. Great option. If you want to post as an answer, go ahead – akrun May 04 '17 at 06:11
smart way to join without keys. The on= argument is new for me. Thanks! How would it look like if the 2nd table has different column names like `DT3 <- data.table( E = c(1, 2, 3), F = c(1,3,1), D = c(4:6))` – visu-l May 04 '17 at 06:40
1

@visu-l In that case, you would od `DT1[DT3, on =.(A=E, B=F)]` – akrun May 04 '17 at 06:43
could you please explain the second chain `[, ..nm1]`. I found no documentation about ".." – visu-l May 04 '17 at 06:46
@visu-l It is a new feature in data.table. Usually, we use `DT[, nm1, with = FALSE]` – akrun May 04 '17 at 06:51

score 2 · Accepted Answer · answered May 04 '17 at 06:16

Following @akrun's answer, you can identify the rows in the join and use them to subset the table:

w = sort(DT1[DT2, on=.(A,B), which=TRUE, nomatch=0])
DT1[w]

#    A B C
# 1: 1 1 1
# 2: 3 1 3
# 3: 2 3 1

or more compactly

DT1[sort(DT1[DT2, on=.(A,B), which=TRUE, nomatch=0])]

If you want to keep rows in the order from DT2, don't sort; and if you want unmatched rows included, skip nomatch=0.

score 2 · Answer 3 · answered May 04 '17 at 07:00

 setkey(DT1, A,B,C)
 DT1[DT2]
# not quite right
   A B C
1: 1 1 4
2: 2 3 5
3: 3 1 6
# so join just on the two shared columns
DT1[ DT2[,list(A,B)]  ]
   A B C
1: 1 1 1
2: 2 3 1
3: 3 1 3

Look at DT2[,list(A,B)] and see that naming the desired columns (without quotes) in a list is the usual method of getting subsets of columns.

Subset data.table by another data.table without merging all columns

3 Answers3