44

How do I perform a semi-join with data.table? A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y. For example, the following code performs an inner join:

x <- data.table(x = 1:2, y = c("a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)

x[y]
#   x y  z
# 1: 1 a 10
# 2: 1 a 11

A semi-join would return just x[1]

Jim G.
  • 15,141
  • 22
  • 103
  • 166
hadley
  • 102,019
  • 32
  • 183
  • 245

8 Answers8

23

More possibilities :

w = unique(x[y,which=TRUE])  # the row numbers in x which have a match from y
x[w]

If there are duplicate key values in x, then that needs :

w = unique(x[y,which=TRUE,allow.cartesian=TRUE])
x[w]

Or, the other way around :

setkey(y,x)
w = !is.na(y[x,which=TRUE,mult="first"])
x[w]

If nrow(x) << nrow(y) then the y[x] approach should be faster.
If nrow(x) >> nrow(y) then the x[y] approach should be faster.

But the anti anti join appeals too :-)

Arun
  • 116,683
  • 26
  • 284
  • 387
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 2
    Cool! Now I understand what `allow.cartesian=TRUE` is for. – Josh O'Brien Sep 24 '13 at 01:22
  • I knew that guru will stop by! Since you officially liked the anti anti join approach, how about adding `x[!!y]` to the syntax? :) – Victor K. Sep 24 '13 at 01:49
  • @VictorK. :) `x[!!y]` is an error currently isn't it, so ok. Please file as a feature request. – Matt Dowle Sep 24 '13 at 07:01
  • 1
    For what it's worth, the anti-join operation was implemented as `x[!y]` and not the syntax above. – Jthorpe May 10 '16 at 23:58
  • This solution causes R to crash for me with moderately large data.tables. (x ~ 3 million rows, and y ~ 2,000 rows). Really wish semi-join operator was implemented as part of the data.table core... – Jthorpe Jan 03 '19 at 19:16
  • @Jthorpe Have you searched the issue list to see if it's on the list properly. If you can provide a reproducible example for your crash then it would really help raise the priority. I suspect that something else is wrong though in your query though. We need to see the output/warnings/verbose to help. – Matt Dowle Jan 04 '19 at 00:00
  • double anti join: `tmp <- x[!y]; x[!tmp]`? – s_baldur Jan 22 '19 at 08:53
14

One solution I can think of is:

tmp <- x[!y]
x[!tmp]

In data.table, you can have another data table as an i expression (i.e., the first expression in the data.table.[ call), and that will perform a join, e.g.:

x <- data.table(x = 1:10, y = letters[1:10])
setkey(x, x)
y <- data.table(x = c(1,3,5,1), z = 1:4)

> x[y]
   x y z
1: 1 a 1
2: 3 c 2
3: 5 e 3
4: 1 a 4

The ! before the i expression is an extension of the syntax above that performs a 'not-join', as described on p. 11 of data.table documentation. So the first assignments evaluates to a subset of x that doesn't have any rows where the key (column x) is present in y:

> x[!y]
    x y
1:  2 b
2:  4 d
3:  6 f
4:  7 g
5:  8 h
6:  9 i
7: 10 j

It is similar to setdiff in this regard. And therefore the second statement returns all the rows in x where the key is present in y.

The ! feature was added in data.table 1.8.4 with the following note in NEWS:

o   A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384i.
        DT[-DT["a", which=TRUE, nomatch=0]]   # old not-join idiom, still works
        DT[!"a"]                              # same result, now preferred.
        DT[!J(6),...]                         # !J == not-join
        DT[!2:3,...]                          # ! on all types of i
        DT[colA!=6L | colB!=23L,...]          # multiple vector scanning approach (slow)
        DT[!J(6L,23L)]                        # same result, faster binary search
    '!' has been used rather than '-' :
        * to match the 'not-join'/'not-where' nomenclature
        * with '-', DT[-0] would return DT rather than DT[0] and not be backwards
          compatible. With '!', DT[!0] returns DT both before (since !0 is TRUE in
          base R) and after this new feature.
        * to leave DT[+J...] and DT[-J...] available for future use

For some reason, the following doesn't work x[!(x[!y])] - probably data.table is too smart about parsing the argument.

P.S. As Josh O'Brien pointed in another answer, a one-line would be x[!eval(x[!y])].

Victor K.
  • 4,054
  • 3
  • 25
  • 38
10

I'm confused with all the not-joins above, isn't what you want simply:

unique(x[y, .SD])
#   x y
#1: 1 a

If x can have duplicate keys, then you can unique y instead:

## Creating an example data.table 'a' three-times-repeated first row 
x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
setkey(x, x)
y <- data.table(x = c(1, 1), z = 10:11)
setkey(y, x)

x[eval(unique(y, by = key(y))), .SD] # data.table >= 1.9.8 requires by=key(y)
#    x y
# 1: 1 a
# 2: 1 a
# 3: 1 a
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
eddi
  • 49,088
  • 6
  • 104
  • 155
3

Update. Based on all the discussion here, I would do something like this, which should be fast and work in the most general case:

x[eval(unique(y[, key(x), with = FALSE]))]

Here is another, more direct solution:

unique(x[eval(y$x)])

It's more direct and runs faster - here is the comparison in run times with my previous solution:

# Generate some large data
N <- 1000000 * 26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE),  z = sample(letters, N/10, replace = TRUE))
setkey(y, x)

system.time(r1 <- x[!eval(x[!y])])
   user  system elapsed 
  7.772   1.217  11.998 

system.time(r2 <- unique(x[eval(y$x)]))
   user  system elapsed 
  0.540   0.142   0.723 

In a more general case, you can do something like

x[eval(y[, key(x), with = FALSE])]
Victor K.
  • 4,054
  • 3
  • 25
  • 38
3

I tried to write a method that doesn't use any names, which are downright confusing in the OP's example.

sJ <- function(x,y){
    ycols <- 1:min(ncol(y),length(key(x)))
    yjoin <- unique(y[, ..ycols])
    yjoin
}

x[eval(sJ(x,y))]

For Victor's simpler example, this gives the desired output:

   x y
1: 1 a
2: 3 c
3: 5 e

This is a ~30% slower than Victor's way.

EDIT: And Victor's approach, taking unique before joining, is quite a bit faster:

N <- 1e5*26
x <- data.table(x = 1:N, y = letters, z = rnorm(N))
setkey(x, x)
y <- data.table(x = sample(N, N/10, replace = TRUE),  z = sample(letters, N/10, replace = TRUE))
setkey(y, x)
require(microbenchmark)
microbenchmark(
    sJ=x[eval(sJ(x,y))],
    dolla=unique(x[eval(y$x)]),
    brack=x[eval(unique(y[['x']]))]
)
Unit: milliseconds
  expr       min        lq    median        uq      max neval
 #    sJ 120.22700 125.04900 126.50704 132.35326 217.6566   100
 # dolla 105.05373 108.33804 109.16249 118.17613 285.9814   100
 # brack  53.95656  61.32669  61.88227  65.21571 235.8048   100

I'm guessing the [[ vs $ doesn't help the speed, but didn't check.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Frank
  • 66,179
  • 8
  • 96
  • 180
2

This thread is so old. But I noticed that the solution can be easily derived from the definition of semi-join given in the original post:

"A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y"

library(data.table)
dt1 <-  data.table(ProdId = 1:4,
                   Product = c("Bread", "Cheese", "Pizza", "Butter"))
dt2 <-  data.table(ProdId = c(1, 1, 3, 4, 5),
                   Company = c("A", "B", "C", "D", "E"))

# semi-join
unique(merge(dt1, dt2, on="ProdId")[, names(dt1), with=F])
   ProdId Product
1:      1   Bread
2:      3   Pizza
3:      4  Butter

I've simply applied the syntax of inner-join, followed by filtering columns from first table only, with unique() to remove rows of first table which were repeated to match rows of second table.

Edit: The above approach will match dplyr::semi_join() output only if we have unique rows in the first table. If we need to output all the rows including duplicates from first table, then we may use fsetdiff() method shown below.

Another one line data.table solution:

fsetdiff(dt1, dt1[!dt2, on="ProdId"])
   ProdId Product
1:      1   Bread
2:      3   Pizza
3:      4  Butter

I've just removed from first table the anti-join of first and second. Seems simpler to me. If the first table has duplicate rows, we will need:

fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)

The fsetdiff() result with ,all=T matches the output from dplyr:

dplyr::semi_join(dt1, dt2, by="ProdId")
  ProdId Product
1      1   Bread
2      3   Pizza
3      4  Butter

Using another set of data taken from one of previous posts:

x <- data.table(x = c(1,1,1,2), y = c("a", "a", "a", "b"))
y <- data.table(x = c(1, 1), z = 10:11)

With dplyr:

dplyr::semi_join(x, y, by="x")
  x y
1 1 a
2 1 a
3 1 a

With data.table:

fsetdiff(x, x[!y, on="x"], all=T)
   x y
1: 1 a
2: 1 a
3: 1 a

Without ,all=T, the duplicate rows are removed:

fsetdiff(x, x[!y, on="x"])
   x y
1: 1 a
San
  • 518
  • 5
  • 14
  • Your first approach runs into the same problems as eddi's first answer when `x` has identical (key) rows. – BenBarnes Aug 15 '17 at 09:47
  • I've noticed that it's effectively the same expression. Can you point out cases where the `fsetdiff()` approach will fail? – San Aug 15 '17 at 10:06
  • Seems like the `fsetdiff` approach also gets rid of duplicate rows in x just like the first approach. Try using the data in eddi's second approach. The result should be 3 rows, but `fsetdiff` only returns 1. – BenBarnes Aug 15 '17 at 10:27
  • `fsetdiff(dt1, dt1[!dt2, on="ProdId"], all=T)` will get all the rows. I've updated my post to include both situations. – San Aug 15 '17 at 11:07
  • The fsetdiff with all=TRUE is a nice variation on Victor's answer, but I still don't find it satisfactory, syntax-wise. There are a couple open requests, fyi: #915 (also linked under Matt's answer) and my https://github.com/Rdatatable/data.table/issues/2158 (that I wrote because I forgot about the first one...) – Frank Aug 15 '17 at 18:15
  • @Frank: I agree. `data.table` library should have a **direct** syntax for semi-join instead of users having to "derive" it by using other `data.table` functions. – San Aug 16 '17 at 02:42
0

The package dplyr supports the following four join types:

inner_join, left_join, semi_join, anti_join

So for the semi-join try the following code

library("dplyr")

table1 <- data.table(x = 1:2, y = c("a", "b"))
table2 <- data.table(x = c(1, 1), z = 10:11)

semi_join(table1, table2)

The output is as expected:

# Joining by: "x"
# Source: local data table [1 x 2]
# 
#       x     y
#   (int) (chr)
# 1     1     a
rawr
  • 20,481
  • 4
  • 44
  • 78
Brian Bole
  • 35
  • 5
  • 3
    Yes, I asked the question so I could figure out how to make `semi_join()` work for data tables ;) – hadley Jan 27 '15 at 01:03
0

Try the following:

 w <- y[,unique(x)]
 x[x %in% w]

Output will be:

   x y
1: 1 a
PT2018
  • 31
  • 5