Select rows from one data.frame that are not present in a second data.frame

Question

I have two data.frames:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

I want to find the rows a1 have that a2 doesn't.

Is there a built in function for this type of operation?

(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

Here is my solution:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)

score 183 · Answer 1 · edited Feb 05 '21 at 17:10

183

sqldf provides a nice solution

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

And the rows which are in both data frames:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

require(dplyr) 
anti_join(a1,a2)

And semi_join to filter rows in a1 that are also in a2

semi_join(a1,a2)

edited Feb 05 '21 at 17:10

anpami

760
5
17

answered Jan 29 '13 at 08:53

Rickard

3,600
2
19
22

27

Thanks for `anti_join` and `semi_join`! – drastega Sep 08 '15 at 18:14
is there a reason why anti_join would return a null DF, as would sqldf, but the functions identical(a1,a2) and all.equal() would contradict that? – 3pitt Oct 10 '17 at 18:38
Just wanted to add here that anti_join and semi_join would not work in some cases like mine. I was getting "Error: Columns must be 1d atomic vectors or lists" for my data frame. Maybe I could process my data so that these functions work. Sqldf worked right out of the gate! – Akshay Gaur Nov 27 '17 at 15:53
@AkshayGaur it should just be a data format or data cleaning problem; sqldf is just sql everything is pre-processed to be like nromal DB such that we could just run sql on the data. – stucash Dec 19 '19 at 23:15

leerssej · Answer 2 · 2017-04-19T22:25:47.353

108

In dplyr:

setdiff(a1,a2)

Basically, setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

In the SQLverse this is called a

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

But back to this question - here are the results for the setdiff() code when using the OP's data:

> a1
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
  a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
  a b
1 4 d
2 5 e

Or even anti_join(a1,a2) will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

edited Apr 19 '17 at 22:25

answered Nov 10 '16 at 03:14

leerssej

14,260
6
48
57

2

Since the OP asks for items in `a1` that are not in `a2`, don't you want to use something like `semi_join(a1, a2, by = c('a','b'))` ? In the answer by "Rickard", I see that `semi_join` was suggested. – steveb Dec 14 '16 at 08:29
Sure! Another great choice, too; particularly if you have dataframes with only a join key and differing column names. – leerssej Apr 19 '17 at 22:58
setdiff is from lubridate::setdiff and not from library(dplyr) – mtelesha Sep 25 '18 at 17:19
1

@mtelesha - Hmm, the docs and source code for _dplyr_ show it being there: (https://dplyr.tidyverse.org/reference/setops.html , https://github.com/tidyverse/dplyr/blob/master/R/sets.). Additionally, when the dplyr library is loaded it even reports masking the base `setdiff()` function that works on two vectors: https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html. Maybe you have loaded the _lubridate_ library after _dplyr_ and it is suggesting it as the source in the tabcomplete listing? – leerssej Sep 25 '18 at 19:21
1

There is a conflict between lubridate and dplyr, see https://github.com/tidyverse/lubridate/issues/693 – slhck May 20 '19 at 14:30
Also consider `dplyr::intersection` – bmc May 03 '21 at 16:29

score 96 · Accepted Answer · edited Jul 14 '16 at 16:59

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
#  a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
   data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e

I find this function confusing. I thought it would work for me, but it seems to only work as shown above if one set contains identically matching rows of the other set. Consider this case: `a2 <- data.frame(a = c(1:3, 1), b = c(letters[1:3], "c"))`. Leave `a1` the same. Now try the comparison. It's not clear to me even in reading the options what the proper way is to list only common elements. — Hendy, Aug 08 '13 at 02:59

score 45 · Answer 4 · answered Jul 03 '10 at 17:50

45

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

answered Jul 03 '10 at 17:50

Eduardo Leoni

8,991
6
42
49

So... in looking for a missing value, you create another missing value... How do you find the missing value(s) in `included_a1`? :-/ – Louis Maddox Jul 12 '14 at 15:15
1

use is.na() and subset, or dplyr::filter – Eduardo Leoni Sep 18 '15 at 16:53
3

Thank you for teaching a way without installing a new library! – Rodrigo Oct 08 '18 at 16:12

score 31 · Answer 5 · answered Mar 08 '16 at 16:08

31

I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
  > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
  > df_compare = compare_df(df1, df2, "row")

  > df_compare$comparison_df
    row chng_type a b
  1   4         + 4 d
  2   5         + 5 e

A more complicated example:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", "Duster 360", "Merc 240D"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
                 hp = c(110, 110, 181, 110, 245, 62),
                 cyl = c(6, 6, 4, 6, 8, 4),
                 qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
                 hp = c(110, 110, 93, 110, 175, 105),
                 cyl = c(6, 6, 4, 6, 8, 6),
                 qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare$comparison_df
    grp chng_type                id1 id2  hp cyl  qsec
  1   1         -  Hornet Sportabout Dus 175   8 17.02
  2   2         +         Datsun 710 Dat 181   4 33.00
  3   2         -         Datsun 710 Dat  93   4 18.61
  4   3         +         Duster 360 Dus 245   8 15.84
  5   7         +          Merc 240D Mer  62   4 20.00
  6   8         -            Valiant Val 105   6 20.22

The package also has an html_output command for quick checking

df_compare$html_output

answered Mar 08 '16 at 16:08

Alex Joseph

4,719
2
21
25

your compareDF is exactly what I need , and have done a good job with small sets.However:1) Not working with a set 50Million rows with 3 column (say) it says out of memory with 32 GB RAM. 2) I also see HTML takes some time to write, can the same output be sent to TEXT file? – Deep Feb 13 '19 at 06:44
1) Yeah 50 million rows is A LOT OF data, just to hold in memory ;). I'm aware that it is not great with large datasets, so you might have to do some sort of chunking. 2) you can give the argument - limit_html = 0, to avoid it printing to a HTML. The same output is in compare_output$comparison_df which you can write to a CSV/TEXT fule using native R functions. – Alex Joseph Feb 13 '19 at 11:37
Thanks for your reply @Alex Joseph , I will give it a try and let you know how it goes. – Deep Feb 21 '19 at 10:26
Hi @Alex Joseph, thanks for the input the text format did work but found an issue , raised it under: https://stackoverflow.com/questions/54880218/compare-df-in-r-truncating-trailing-zero – Deep Feb 26 '19 at 07:19
It can't handle different numbers of columns. I got an error `The two data frames have different columns!` – PM0087 Jun 04 '20 at 08:39
@PeyM87 - If the columns are different it's very easily visible from the names(df) right? What is the behaviour you're expecting? If you can create a in issue reprex on the github, I can take a look at it. – Alex Joseph Jun 05 '20 at 12:14
@AlexJoseph: I have dataframe1 with X number of columns. After some time, new data comes in and I have dataframe2 with Y number of colums, in which some columns are always common. I thought this compare would for instance SUM the common columns and Add columns if there are any new. – PM0087 Jun 09 '20 at 07:51
If you have columns changing, doing a setdiff(names(df1), names(df2)) is probably the best approach – Alex Joseph Jun 10 '20 at 08:42
Works great, thanks so much, any easy method of identifying which dataset the change is in? For instance, the + is from df2 in the second row above. – Lowpar Oct 24 '22 at 14:26
@Lowpar - You can try the change_markers, option for this – Alex Joseph Oct 26 '22 at 03:38
Thanks @AlexJoseph, I might be missing something, but I like the markers, my challenge is, if I find a difference, then how do I know which dataset it came from? So if I have dataset a and dataset b, and I see there is a +, any way of knowing automatically is the + in dataset a or dataset b? Super thankful for your amazing package! You are a boss!! – Lowpar Oct 26 '22 at 11:22
1

No worries @Lowpar - The parameters for the function are `df_new` and `df_old`. If there is a `+` marker, it means that the `df_new` data.frame has it and `-` marker means that `df_old` data.frame has it changed. In the above example, Datsun 710, has a value of 181 HP in the new data frame and 93 HP in the old data.frame – Alex Joseph Oct 26 '22 at 20:11
Makes complete sense now. I was thinking + meant that a value was included, while - meant a value was missing. But seems very intuitive now. – Lowpar Nov 07 '22 at 10:27

Salim B · Answer 6 · 2021-06-22T13:02:46.283

You could use the daff package (which wraps the daff.js library using the V8 package):

library(daff)

diff_data(data_ref = a2,
          data = a1)

produces the following difference object:

Daff Comparison: ‘a2’ vs. ‘a1’ 
  First 6 and last 6 patch lines:
   @@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e

The tabular diff format is described here and should be pretty self-explanatory. The lines with +++ in the first column @@ are the ones which are new in a1 and not present in a2.

The difference object can be used to patch_data(), to store the difference for documentation purposes using write_diff() or to visualize the difference using render_diff():

render_diff(
    diff_data(data_ref = a2,
              data = a1)
)

generates a neat HTML output:

score 11 · Answer 7 · answered Sep 27 '16 at 21:32

11

Using diffobj package:

library(diffobj)

diffPrint(a1, a2)
diffObj(a1, a2)

answered Sep 27 '16 at 21:32

zx8754

52,746
12
114
209

score 10 · Answer 8 · edited May 24 '18 at 19:46

I adapted the merge function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.

Another solution is to use the library prob.

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

XinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = FALSE, incomparables = NULL,
             ...)
{
    fix.by <- function(by, df)
    {
        ## fix up 'by' to be a valid set of cols by number: 0 is row.names
        if(is.null(by)) by <- numeric(0L)
        by <- as.vector(by)
        nc <- ncol(df)
        if(is.character(by))
            by <- match(by, c("row.names", names(df))) - 1L
        else if(is.numeric(by)) {
            if(any(by < 0L) || any(by > nc))
                stop("'by' must match numbers of columns")
        } else if(is.logical(by)) {
            if(length(by) != nc) stop("'by' must match number of columns")
            by <- seq_along(by)[by]
        } else stop("'by' must specify column(s) as numbers, names or logical")
        if(any(is.na(by))) stop("'by' must specify valid column(s)")
        unique(by)
    }

    nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
    by.x <- fix.by(by.x, x)
    by.y <- fix.by(by.y, y)
    if((l.b <- length(by.x)) != length(by.y))
        stop("'by.x' and 'by.y' specify different numbers of columns")
    if(l.b == 0L) {
        ## was: stop("no columns to match on")
        ## returns x
        x
    }
    else {
        if(any(by.x == 0L)) {
            x <- cbind(Row.names = I(row.names(x)), x)
            by.x <- by.x + 1L
        }
        if(any(by.y == 0L)) {
            y <- cbind(Row.names = I(row.names(y)), y)
            by.y <- by.y + 1L
        }
        ## create keys from 'by' columns:
        if(l.b == 1L) {                  # (be faster)
            bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
            by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
        } else {
            ## Do these together for consistency in as.character.
            ## Use same set of names.
            bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
            names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
            bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
            bx <- bz[seq_len(nx)]
            by <- bz[nx + seq_len(ny)]
        }
        comm <- match(bx, by, 0L)
        if (notin) {
            res <- x[comm == 0,]
        } else {
            res <- x[comm > 0,]
        }
    }
    ## avoid a copy
    ## row.names(res) <- NULL
    attr(res, "row.names") <- .set_row_names(nrow(res))
    res
}


XnotinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = TRUE, incomparables = NULL,
             ...)
{
    XinY(x,y,by,by.x,by.y,notin,incomparables)
}

score 8 · Answer 9 · edited Sep 02 '18 at 06:53

Your example data does not have any duplicates, but your solution handle them automatically. This means that potentially some of the answers won't match to results of your function in case of duplicates.
Here is my solution which address duplicates the same way as yours. It also scales great!

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}

library(data.table)
setDT(a1)
setDT(a2)

# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

It needs data.table 1.9.8+

score 3 · Answer 10 · edited Jun 23 '16 at 19:56

3

Maybe it is too simplistic, but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets. Hope it can help.

a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]

edited Jun 23 '16 at 19:56

user2100721

3,557
2
20
29

answered May 10 '16 at 13:24

Kenia Sousa

31
3

How is this different from what OP already tried? You've used the exact same code like Tal to compare a single column instead of the whole row (which was the requirement) – David Arenburg Sep 05 '16 at 08:29

score 2 · Answer 11 · edited Sep 02 '18 at 06:54

2

Using subset:

missing<-subset(a1, !(a %in% a2$a))

edited Sep 02 '18 at 06:54

David Arenburg

91,361
17
137
196

answered Jul 30 '18 at 03:01

Emily

470
5
16

This answer works for the OP's scenario. What about the more general case when the variable "a" does match between the two data.frames("a1" and "a2"), but the variable "b" does not? – Bryan F Sep 12 '18 at 20:44

score 1 · Answer 12 · answered Apr 14 '16 at 13:26

Yet another solution based on match_df in plyr. Here's plyr's match_df:

match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[keys$x %in% keys$y, , drop = FALSE]
}

We can modify it to negate:

library(plyr)
negate_match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[!(keys$x %in% keys$y), , drop = FALSE]
}

Then:

diff <- negate_match_df(a1,a2)

score 1 · Answer 13 · answered Apr 07 '20 at 17:49

The following code uses both data.table and fastmatch for increased speed.

library("data.table")
library("fastmatch")

a1 <- setDT(data.frame(a = 1:5, b=letters[1:5]))
a2 <- setDT(data.frame(a = 1:3, b=letters[1:3]))

compare_rows <- a1$a %fin% a2$a
# the %fin% function comes from the `fastmatch` package

added_rows <- a1[which(compare_rows == FALSE)]

added_rows

#    a b
# 1: 4 d
# 2: 5 e

score 0 · Answer 14 · answered Nov 19 '20 at 10:17

Really fast comparison, to get count of differences. Using specific column name.

colname = "CreatedDate" # specify column name
index <- match(colname, names(source_df)) # get index name for column name
sel <- source_df[, index] == target_df[, index] # get differences, gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

For complete dataframe, do not provide column or index name

sel <- source_df[, ] == target_df[, ] # gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

Select rows from one data.frame that are not present in a second data.frame

14 Answers14

Linked

Related