Detect differences between two lists of data.frames

Question

Background/Underlying problem:

I have a data extraction job from a production database that generates a dataset used as reference for further analytics work . Between every build of the dataset I want to be able to detect if anything has changed in the dataset. The dataset is stored as a list of dataframes.

Question:

The following code is a simplified version of my dataset

account_1 <- data.frame(id = c(1,2), name=c("Andy", "Ben"))
account_2 <- data.frame(id = c(1,2,3), name=c("Andy", "Ben2","Ceasar"))
bill <- data.frame(id=c(101,102,103), account_id = c(1,2,3), amount=c(100,100,100))

db_1 = list(account=account_1, bill=bill)
db_2 = list(account=account_2, bill=bill)

I want to be able to detect the delta between db_1 and db_2, which is in the account data frame. Pseudo code:

delta(db_1, db_2)

Should return

account
id name
2 Ben2
3 Ceasar

bill
<<NULL>>

The response is read by human, not a machine, so there a no tight formal requirements on the response.

Other items to consider:

relatively small data sets, no need to bother about memory/speed
I like tidyverse code
If we can have git-style comments around new records (Ceasar) and modified (Ben) that is great

Edit: Data frame comparison can be done by

anti_joins (thanks Linus) OR
the response in Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2 (thanks Vivek)

BUT what is the best way of looping through the list of data.frames

have a look at the dplyr anti_join it returns the difference between two data.frames. But it would also return Ben2 in your example. — Linus, Dec 23 '17 at 10:20
Possibly you may get https://stackoverflow.com/questions/3171426/compare-two-data-frames-to-find-the-rows-in-data-frame-1-that-are-not-present-in — Vivek Harikrishnan, Dec 23 '17 at 10:27
You might want to try the [diffobj package](https://cran.r-project.org/web/packages/diffobj/vignettes/diffobj.html), which I find useful. — p0bs, Dec 23 '17 at 12:38

Nate · Accepted Answer · 2017-12-23T12:30:25.733

the base function mapply() with let you iterate through multiple objects (lists or vectors):

mapply(function(x, y) dplyr::anti_join(x, y),
       db_2, db_1) # order matters for the result, since it returns rows in x that don't have a match in y

$account
  id   name
1  2   Ben2
2  3 Ceasar

$bill
[1] id         account_id amount    
<0 rows> (or 0-length row.names)

if you want to use tidyverse functions then use purrr::map2():

purrr::map2(db_2, db_1,  ~ anti_join(.x, .y))

there is also pmap() function if you have more than two lists to compare.

Detect differences between two lists of data.frames

1 Answers1