1

Background/Underlying problem:

I have a data extraction job from a production database that generates a dataset used as reference for further analytics work . Between every build of the dataset I want to be able to detect if anything has changed in the dataset. The dataset is stored as a list of dataframes.

Question:

The following code is a simplified version of my dataset

account_1 <- data.frame(id = c(1,2), name=c("Andy", "Ben"))
account_2 <- data.frame(id = c(1,2,3), name=c("Andy", "Ben2","Ceasar"))
bill <- data.frame(id=c(101,102,103), account_id = c(1,2,3), amount=c(100,100,100))

db_1 = list(account=account_1, bill=bill)
db_2 = list(account=account_2, bill=bill)

I want to be able to detect the delta between db_1 and db_2, which is in the account data frame. Pseudo code:

delta(db_1, db_2)

Should return

account
id name
2 Ben2
3 Ceasar

bill
<<NULL>>

The response is read by human, not a machine, so there a no tight formal requirements on the response.

Other items to consider:

  • relatively small data sets, no need to bother about memory/speed
  • I like tidyverse code
  • If we can have git-style comments around new records (Ceasar) and modified (Ben) that is great

Edit: Data frame comparison can be done by

BUT what is the best way of looping through the list of data.frames

rgustavs
  • 97
  • 7
  • have a look at the dplyr anti_join it returns the difference between two data.frames. But it would also return Ben2 in your example. – Linus Dec 23 '17 at 10:20
  • Possibly you may get https://stackoverflow.com/questions/3171426/compare-two-data-frames-to-find-the-rows-in-data-frame-1-that-are-not-present-in – Vivek Harikrishnan Dec 23 '17 at 10:27
  • You might want to try the [diffobj package](https://cran.r-project.org/web/packages/diffobj/vignettes/diffobj.html), which I find useful. – p0bs Dec 23 '17 at 12:38
  • P0bs - this package was new to me. Very helpful for. Thanks – rgustavs Dec 23 '17 at 17:10

1 Answers1

2

the base function mapply() with let you iterate through multiple objects (lists or vectors):

mapply(function(x, y) dplyr::anti_join(x, y),
       db_2, db_1) # order matters for the result, since it returns rows in x that don't have a match in y

$account
  id   name
1  2   Ben2
2  3 Ceasar

$bill
[1] id         account_id amount    
<0 rows> (or 0-length row.names)

if you want to use tidyverse functions then use purrr::map2():

purrr::map2(db_2, db_1,  ~ anti_join(.x, .y))

there is also pmap() function if you have more than two lists to compare.

Nate
  • 10,361
  • 3
  • 33
  • 40