Background/Underlying problem:
I have a data extraction job from a production database that generates a dataset used as reference for further analytics work . Between every build of the dataset I want to be able to detect if anything has changed in the dataset. The dataset is stored as a list of dataframes.
Question:
The following code is a simplified version of my dataset
account_1 <- data.frame(id = c(1,2), name=c("Andy", "Ben"))
account_2 <- data.frame(id = c(1,2,3), name=c("Andy", "Ben2","Ceasar"))
bill <- data.frame(id=c(101,102,103), account_id = c(1,2,3), amount=c(100,100,100))
db_1 = list(account=account_1, bill=bill)
db_2 = list(account=account_2, bill=bill)
I want to be able to detect the delta between db_1 and db_2, which is in the account data frame. Pseudo code:
delta(db_1, db_2)
Should return
account
id name
2 Ben2
3 Ceasar
bill
<<NULL>>
The response is read by human, not a machine, so there a no tight formal requirements on the response.
Other items to consider:
- relatively small data sets, no need to bother about memory/speed
- I like tidyverse code
- If we can have git-style comments around new records (Ceasar) and modified (Ben) that is great
Edit: Data frame comparison can be done by
- anti_joins (thanks Linus) OR
- the response in Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2 (thanks Vivek)
BUT what is the best way of looping through the list of data.frames