3

I am working with different data sets in R. Each data set will have upto 16 columns and 1000s of records. I am trying to find a way to compare two data sets at a time so that I can find the records that were deleted/Updated/Added. I will be using the ID column and color column to identify the variances. Below is a small set example (have not included all the columns):

df1 <- data.frame(ID = letters[1:5], color = c("blue", "white", "red", "green", "blue"))

df2 <- data.frame(ID = c("a","c","d","d"), color = c("blue", "yellow", "green", "blue"))

ID will be the common factor between the datasets.

I need to compare the datasets to get three different set of values:

New Records: records that appear in df1 but not in df2. So I should get:

ID  Color
b   white
c   red
e   blue

Deleted records: records that DO NOT appear in df1 but do appear in df2

   ID    Color
    c     yellow
    d     blue

Updated records This is the most important one I need. Basically anything with the same ID that has a different color:

   ID  df1color  df2color
    c   red       yellow

I have tried using the joins from dplyr package....but was not successful. Is there a way I can do this in R.

jalsa
  • 101
  • 2
  • 2
  • 7
  • Possible duplicate of http://stackoverflow.com/questions/20225110/comparing-two-dataframes-and-getting-the-differences – Stedy Sep 27 '16 at 20:48
  • @Stedy The link is python solution. – zx8754 Sep 27 '16 at 20:54
  • kind of...but it still does not talk about how to find the records that have been updated – jalsa Sep 27 '16 at 20:57
  • 2
    Oy! here is a better link: http://stackoverflow.com/questions/3171426/compare-two-data-frames-to-find-the-rows-in-data-frame-1-that-are-not-present-in – Stedy Sep 27 '16 at 21:14

2 Answers2

1

One problem you're probably dealing with is that data.frame() is mixing up character and factor variables behind the scenes. Check out the str() of your data frames. Instead, better to use tibble(), which you can get from dplyr or tibble packages.

Then, from link @Stedy posted, you can use dplyr's anti_join() to handle first two problems. The final one can be done by applyinginner_join() to a data frame of new records, and then filter() to find changes. See example below:

library(dplyr)

df1 <- tibble(ID = c(letters[1:5]), color = c("blue", "white", "red", "green", "blue"))
df2 <- tibble(ID = c("a","c","d","d"), color = c("blue", "yellow", "green", "blue"))

# New Records
anti_join(df1, df2)
#> # A tibble: 3 x 2
#>      ID color
#>   <chr> <chr>
#> 1     e  blue
#> 2     c   red
#> 3     b white

# Deleted records (simply swap arguments around)
anti_join(df2, df1)
#> # A tibble: 2 x 2
#>      ID  color
#>   <chr>  <chr>
#> 1     d   blue
#> 2     c yellow

# Updated records
new_records <- anti_join(df1, df2)
inner_join(new_records, df2, by = "ID", suffix = c(".df1", ".df2")) %>%
  filter(color.df1 != color.df2)
#> # A tibble: 1 × 3
#>      ID color.df1 color.df2
#>   <chr>     <chr>     <chr>
#> 1     c       red    yellow
Simon Jackson
  • 3,134
  • 15
  • 24
  • 1
    Hi Simon - thanks for your help. and thank you for introducing me to tibble. It is so much better!! Your query for new and deletes work, however the update is not exactly what I was looking for. For updates, the result should only have ID 'c'. Since 'd' had two colors in df2 but one color in df1--it will be considered as a 'delete' vs an update. Can you help? – jalsa Sep 28 '16 at 19:48
  • Thanks @jalsa. Yeah, tibbles are great! I've just updated the post to handle your request. I opted to join/filter using new records in df1. This takes an extra step, but I think it's necessary. Hope it helps. – Simon Jackson Sep 28 '16 at 21:17
  • Hi Simon - thank you again...It is working with this example. I will try with my real dataset. thanks again. – jalsa Sep 30 '16 at 19:26
0

I think there might be some problems in your question. For example, the IDs of df2 include a, c, d, and d all of which are ID values in df1. So shouldn't the deleted records matrix be empty?

Regardless, I put together a script that might be what you're after. Please let me know if it isn't and Ill try again.

df1 <- data.frame(ID = letters[1:5], color = c("blue", "white", "red", "green", "blue"))

df2 <- data.frame(ID = c("a","c","d","d"), color = c("blue", "yellow", "green", "blue"))

df1=as.matrix(df1)
df2=as.matrix(df2)

##########################
## find the new records ## 
##########################

## define a new record matrix
n.r = matrix(NA,nrow=nrow(df1),ncol=nrow(df2))

## loop over the rows in the new matrix
i=1
while(i<=nrow(n.r)) {
    n <- df1[i,1]==df2[,1]
    n.r[i,] <- n

    i=i+1
}

## these are your new records
df1[-(which(n.r == TRUE, arr.ind=TRUE)[,1]),]

##############################    
## find the deleted records ##
##############################

## define a deleted records matrix
d.r = matrix(NA,ncol=nrow(df1),nrow=nrow(df2))

## loop over the rows in the deleted matrix
i=1
while(i<=nrow(d.r)) {
    d <- df2[i,1]==df1[,1]
    d.r[i,] <- d

    i=i+1
}

## these are your deleted records
df2[-(which(d.r == TRUE, arr.ind=TRUE)[,1]),]

##############################
## find the updated records ##
##############################

## define the same matrix
s.m <- which(n.r==TRUE,arr.ind=TRUE)
## consider the ith row of the same matrix (s.m[i,])
## s.m shows that df1[s.m[i,1],1] == df2[s.m[i,2],1] 

## now define a updated record matrix
u.r <- rep(NA,nrow(s.m))

i=1
while(i<=nrow(s.m)) {
    u.r[i] <- df1[s.m[i,1],2] == df2[s.m[i,2],2]

    i=i+1
}

## these are your updated records
cbind(df1[s.m[which(u.r == FALSE),1],],df2[s.m[which(u.r == FALSE),2],2])

please notice the ID of 'd' shows up TWICE in df2 and only one of them is updated (or different) in df1. This may need to be modified to fit your needs/goals.

Alex Witsil
  • 855
  • 8
  • 22