Minus operation of data frames

Question

I have 2 data frames df1 and df2.

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6) )

> df1
  c1 c2
1  a  1
2  b  2
3  c  3
4  d  4

> df2
  c1 c2
1  c  3
2  d  4
3  e  5
4  f  6

I need to perform set operation of these 2 data frames. I used merge(df1,df2,all=TRUE) and merge(df1,df2,all=FALSE) method to get the union and intersection of these data frames and got the required output. What is the function to get the minus of these data frames,that is all the positions existing on one data frame but not the other? I need the following output.

 c1 c2
1  a  1
2  b  2

Do you want to get lines in df1 that are not in df2 **and** lines in df2 that are not in df1 ? — juba, Apr 22 '13 at 09:39
@juba, I believe this is more of `setdiff` but for `data.frame`s — Arun, Apr 22 '13 at 09:43
Yes, that's what I thought, but the result given is not a `setdiff`. That's why I ask the question :) — juba, Apr 22 '13 at 09:44
sorry, don't follow. `c,3` and `d,4` are present in both. So, `setdiff(df1, df2)` should return those rows not in `df2`, which is `a,1 and b,2`. This seems to be a `setdiff` operation to me (if implemented for data.frame) — Arun, Apr 22 '13 at 09:48
@juba i want another data frame that contains data in df1 that are not in df2 — Dinoop Nair, Apr 22 '13 at 09:48
@Arun I think that `setdiff` should also return `e,5` and `f,6` ? — juba, Apr 22 '13 at 09:50
@juba, that depends on how you do: `setdiff(df1, df2)` should return OP's input. `setdiff(df2, df1)` should return what you say. It's a set operation. So, it should give x entries not in y (so order matters). — Arun, Apr 22 '13 at 09:51
@Arun, you're right, my bad, I thought that `setdiff` didn't depend on the order of its arguments... — juba, Apr 22 '13 at 09:53

Arun · Accepted Answer · 2016-06-20T19:51:33.140

32

I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.

Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).

If you don't worry about rownames then you can do:

df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
#   c1 c2
# 1  a  1
# 2  b  2

Edit: A data.table solution:

dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]

or better one-liner (from v1.9.6+):

setDT(df1)[!df2, on="c1"]

This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.

edited Jun 20 '16 at 19:51

answered Apr 22 '13 at 09:58

Arun

116,683
26
284
387

2

This `data.table` solution is much faster than doing a `setdiff` or `%in%` lookup, thanks! – daroczig May 13 '16 at 21:20

score 10 · Answer 2 · answered Apr 22 '13 at 09:39

10

I prefer sqldf package:

require(sqldf)
sqldf("select * from df1 except select * from df2")

##   c1 c2
## 1  a  1
## 2  b  2

answered Apr 22 '13 at 09:39

Nishanth

6,932
5
26
38

score 4 · Answer 3 · answered Apr 22 '13 at 09:31

You can create identifier columnas then subset:

e.g.

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4), indf1 = rep("Y",4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6),indf2 = rep("Y",4) )
merge(df1,df2)
#  c1 c2 indf1 indf2
#1  c  3     Y     Y
#2  d  4     Y     Y

bigdf <- merge(df1,df2,all=TRUE)
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>
#3  c  3     Y     Y
#4  d  4     Y     Y
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

Then subset how you wish:

 bigdf[is.na(bigdf$indf1) ,]
#  c1 c2 indf1 indf2
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

 bigdf[is.na(bigdf$indf2) ,]  #<- output you requested those not in df2
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>

this is not possible..bcoz the given data is only a sample dataframes.the actual data frames contains huge amount of rows. so the object size may become very large by this method. — Dinoop Nair, Apr 22 '13 at 09:42
@DinoopNair Then you may do a merge with `all.x=TRUE` and subset on `indf2` ? — juba, Apr 22 '13 at 09:56

score 4 · Answer 4 · answered Aug 06 '18 at 09:00

4

I think the simplest solution is with dplyr (tidyverse).

require(tidyverse)
anti_join(df1, df2)

answered Aug 06 '18 at 09:00

JohannesNE

1,343
9
14

score 2 · Answer 5 · edited May 23 '17 at 12:24

One issue with https://stackoverflow.com/a/16144262/2055486 is it assumes neither data frame already has duplicated rows. The following function removes that limitation and also works with arbitrary user defined columns in x or y.

The implementation uses a similar idea to the implementation of duplicated.data.frame in concatenating the columns together with a separator. duplicated.data.frame uses "\r", which can cause collisions if the entries have embedded "\r" characters. This uses the ASCII record separator "\30" which will have a much lower chance of appearing in input data.

setdiff.data.frame <- function(x, y,
    by = intersect(names(x), names(y)),
    by.x = by, by.y = by) {
  stopifnot(
    is.data.frame(x),
    is.data.frame(y),
    length(by.x) == length(by.y))

  !do.call(paste, c(x[by.x], sep = "\30")) %in% do.call(paste, c(y[by.y], sep = "\30"))
}

# Example usage
# remove all 4 or 6 cylinder 4 gear cars or 8 cylinder 3 gear rows
to_remove <- data.frame(cyl = c(4, 6, 8), gear = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove), ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

# with differing column names
to_remove2 <- data.frame(a = c(4, 6, 8), b = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove2, by.x = c("cyl", "gear"), by.y = c("a", "b")), ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

score 1 · Answer 6 · answered Apr 22 '13 at 09:34

1

If you're not planning on using any of the actual data in d2, then you don't need merge at all:

df1[!(df1$c1 %in% df2$c1), ]

answered Apr 22 '13 at 09:34

Hong Ooi

56,353
13
134
187

1

i need to check the values in both columns. – Dinoop Nair Apr 22 '13 at 09:52

score 1 · Answer 7 · answered Apr 22 '13 at 10:55

1

You can check the values in both columns and subset like this (just adding another solution):

na.omit( df1[ sapply( 1:ncol(df1) , function(x) ! df1[,x] %in% df2[,x] ) , ] )
#  c1 c2
#1  a  1
#2  b  2

answered Apr 22 '13 at 10:55

Simon O'Hanlon

58,647
14
142
184

score 0 · Answer 8 · answered Jan 31 '21 at 18:19

Another way with base R (incl. keeping rownames and checking all rows) would be the double usage of duplicated():

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6) )

all <- rbind(df1, df2)
all[!duplicated(all) & !duplicated(all, fromLast = TRUE), ]

score 0 · Answer 9 · answered Feb 06 '21 at 02:39

Your can use dplyr::setdiff easily

dplyr::setdiff(iris, iris[iris$Sepal.Length >6,])

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           5.1         3.5          1.4         0.2     setosa
2           4.9         3.0          1.4         0.2     setosa
3           4.7         3.2          1.3         0.2     setosa
4           4.6         3.1          1.5         0.2     setosa
5           5.0         3.6          1.4         0.2     setosa
6           5.4         3.9          1.7         0.4     setosa
7           4.6         3.4          1.4         0.3     setosa
8           5.0         3.4          1.5         0.2     setosa
9           4.4         2.9          1.4         0.2     setosa
10          4.9         3.1          1.5         0.1     setosa
11          5.4         3.7          1.5         0.2     setosa
12          4.8         3.4          1.6         0.2     setosa
13          4.8         3.0          1.4         0.1     setosa
14          4.3         3.0          1.1         0.1     setosa
15          5.8         4.0          1.2         0.2     setosa
16          5.7         4.4          1.5         0.4     setosa
17          5.4         3.9          1.3         0.4     setosa
18          5.1         3.5          1.4         0.3     setosa
19          5.7         3.8          1.7         0.3     setosa
20          5.1         3.8          1.5         0.3     setosa
21          5.4         3.4          1.7         0.2     setosa
22          5.1         3.7          1.5         0.4     setosa
23          4.6         3.6          1.0         0.2     setosa
24          5.1         3.3          1.7         0.5     setosa
25          4.8         3.4          1.9         0.2     setosa
26          5.0         3.0          1.6         0.2     setosa
27          5.0         3.4          1.6         0.4     setosa
28          5.2         3.5          1.5         0.2     setosa
29          5.2         3.4          1.4         0.2     setosa
30          4.7         3.2          1.6         0.2     setosa
31          4.8         3.1          1.6         0.2     setosa
32          5.4         3.4          1.5         0.4     setosa
33          5.2         4.1          1.5         0.1     setosa
34          5.5         4.2          1.4         0.2     setosa
35          4.9         3.1          1.5         0.2     setosa
36          5.0         3.2          1.2         0.2     setosa
37          5.5         3.5          1.3         0.2     setosa
38          4.9         3.6          1.4         0.1     setosa
39          4.4         3.0          1.3         0.2     setosa
40          5.1         3.4          1.5         0.2     setosa
41          5.0         3.5          1.3         0.3     setosa
42          4.5         2.3          1.3         0.3     setosa
43          4.4         3.2          1.3         0.2     setosa
44          5.0         3.5          1.6         0.6     setosa
45          5.1         3.8          1.9         0.4     setosa
46          4.8         3.0          1.4         0.3     setosa
47          5.1         3.8          1.6         0.2     setosa
48          4.6         3.2          1.4         0.2     setosa
49          5.3         3.7          1.5         0.2     setosa
50          5.0         3.3          1.4         0.2     setosa
51          5.5         2.3          4.0         1.3 versicolor
52          5.7         2.8          4.5         1.3 versicolor
53          4.9         2.4          3.3         1.0 versicolor
54          5.2         2.7          3.9         1.4 versicolor
55          5.0         2.0          3.5         1.0 versicolor
56          5.9         3.0          4.2         1.5 versicolor
57          6.0         2.2          4.0         1.0 versicolor
58          5.6         2.9          3.6         1.3 versicolor
59          5.6         3.0          4.5         1.5 versicolor
60          5.8         2.7          4.1         1.0 versicolor
61          5.6         2.5          3.9         1.1 versicolor
62          5.9         3.2          4.8         1.8 versicolor
63          6.0         2.9          4.5         1.5 versicolor
64          5.7         2.6          3.5         1.0 versicolor
65          5.5         2.4          3.8         1.1 versicolor
66          5.5         2.4          3.7         1.0 versicolor
67          5.8         2.7          3.9         1.2 versicolor
68          6.0         2.7          5.1         1.6 versicolor
69          5.4         3.0          4.5         1.5 versicolor
70          6.0         3.4          4.5         1.6 versicolor
71          5.6         3.0          4.1         1.3 versicolor
72          5.5         2.5          4.0         1.3 versicolor
73          5.5         2.6          4.4         1.2 versicolor
74          5.8         2.6          4.0         1.2 versicolor
75          5.0         2.3          3.3         1.0 versicolor
76          5.6         2.7          4.2         1.3 versicolor
77          5.7         3.0          4.2         1.2 versicolor
78          5.7         2.9          4.2         1.3 versicolor
79          5.1         2.5          3.0         1.1 versicolor
80          5.7         2.8          4.1         1.3 versicolor
81          5.8         2.7          5.1         1.9  virginica
82          4.9         2.5          4.5         1.7  virginica
83          5.7         2.5          5.0         2.0  virginica
84          5.8         2.8          5.1         2.4  virginica
85          6.0         2.2          5.0         1.5  virginica
86          5.6         2.8          4.9         2.0  virginica
87          6.0         3.0          4.8         1.8  virginica
88          5.9         3.0          5.1         1.8  virginica

Minus operation of data frames

9 Answers9

Linked

Related