subset a data.frame based on unique identifiers in another data.frame

Question

Lets say we have to data.frames:

x <- data.frame(date=c(1,2,3,1,3),  id=c("a", "a", "a", "b", "b"), sum=50:54)
y <- data.frame(date=c(1,2,1,3), id=c("a", "a", "b", "b"))

x
  date id sum
1    1  a  50
2    2  a  51
3    3  a  52
4    1  b  53
5    3  b  54
y

  date id
1    1  a
2    2  a
3    1  b
4    3  b

Now, i want to find the row in x that has dates that is not in y, within the same id. In y we have 1, 2 and 3 in id a and in y we only have 1 and 2 in id a. How do i identify (and, preferably remove from x) row number 3 in x?

EDIT: I found a (very ugly and slow) solution, but there has to be a better and faster one? Currently im running it on two large data.frames, and first time it took more than one hour. I need to run it multiple times, so any help would be appreciated.

z <- data.frame()

for (f in 1:length(unique(x$id))) {  #Run the iteration for all the unique id's in x
  id <- unique(x$id)[f]   #find the name of the id in this iteriation
  a <- x[x$id==id,]       #subset x 
  b <- y[y$id==id,]       #subset y
x.new <- a[a$date%in%unique(b$date),] #find the dates that are in x and also in y
z <- rbind(z, x.new)       #bind the two data.frames together
}

score 2 · Accepted Answer · edited May 23 '17 at 12:09

It seems you want an inner join. You are conceptualizing the problem as "find rows in X that are not in Y, then remove them from X," - this is more commonly stated as "keep only rows in X that are also in Y."

There are many ways to do this, it is the default setting for base::merge

merge(x, y, all = F)
#   date id sum
# 1    1  a  50
# 2    1  b  53
# 3    2  a  51
# 4    3  b  54

There are many other options detailed at the R-FAQ How to join (merge) data frames (inner, outer, left, right)?

If you do need to identify the removed rows for some other purpose, dplyr::anti_join is one way. anti_join(x, y) will return the rows in x that are not in y.

library(dplyr)
anti_join(x, y)
# Joining, by = c("date", "id")
#   date id sum
# 1    3  a  52

If speed is an issue, the data.table solution method as in this answer will be fastest. This answer does some fairly comprehensive benchmarking. However, your code is making enough inefficient steps (growing a data frame inside a loop, recomputing the same unique values, sometimes unnecessarily) that my guess is that even base::merge will be several orders of magnitude faster.

subset a data.frame based on unique identifiers in another data.frame

1 Answers1