17

I'm working with R from a SAS/SQL background, and am trying to write code to take two tables, compare them, and provide a list of the discrepancies. This code would be used repeatedly for many different sets of tables, so I need to avoid hardcoding.

I'm working with Identifying specific differences between two data sets in R , but it doesn't get me all the way there.

Example Data, using the combination of LastName/FirstName (which is unique) as a key --

Dataset One --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4321 Tower St    54321   10
Don        Bob         771  North Ave   23232   5
Smith      Mike        732 South Blvd.  77777   3        

Dataset Two --

Last_Name  First_Name  Street_Address   ZIP     VisitCount
Doe        John        1234 Main St     12345   20
Doe        Jane        4111 Tower St    32132   17
Donn       Bob         771  North Ave   11111   5

   Desired Output --

   LastName FirstName VarName         TableOne        TableTwo
   Doe      Jane      StreetAddress   4321 Tower St   4111 Tower St 
   Doe      Jane      Zip             23232           32132
   Doe      Jane      VisitCount      5               17

Note that this output ignores records where I don't have the same ID in both tables (for instance, because Bob's last name is "Don" in one table, and "Donn" in another table, we ignore that record entirely).

I've explored doing this by applying the melt function on both datasets, and then comparing them, but the size data I'm working with indicates that wouldn't be practical. In SAS, I used Proc Compare for this kind of work, but I haven't found an exact equivalent in R.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Netbrian
  • 621
  • 8
  • 16
  • What's the size of your data? Maybe it is an interesting information to include in your question. :) – Bonifacio2 Jan 20 '15 at 23:22
  • You don't want John Doe included in the result? His name is repeated in both. Or you just want the differences if they are not exact duplicates? – Rich Scriven Jan 20 '15 at 23:24
  • The data size varies a lot, but it usually tops out at about 1 million records or so. – Netbrian Jan 20 '15 at 23:24
  • No, John Doe wouldn't be in the result, because there weren't any mismatches in that record. – Netbrian Jan 20 '15 at 23:24
  • I would recommend `rbind`ing the two together, then remove the exact dupes with `duplicated`, then find the `setdiff` based on first and last name – Rich Scriven Jan 20 '15 at 23:27
  • Richard Scriven -- Could you give a bit more detail/sample code? I'm still new at this, and am not entirely clear on how that would work. – Netbrian Jan 20 '15 at 23:30

3 Answers3

12

Here is a solution based on data.table:

library(data.table)

# Convert into data.table, melt
setDT(d1)
d1 <- d1[, list(VarName = names(.SD), TableOne = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

setDT(d2)
d2 <- d2[, list(VarName = names(.SD), TableTwo = unlist(.SD, use.names = F)),by=c('Last_Name','First_Name')]

# Set keys for merging
setkey(d1,Last_Name,First_Name,VarName)

# Merge, remove duplicates
d1[d2,nomatch=0][TableOne!=TableTwo]

#     Last_Name First_Name        VarName      TableOne      TableTwo
#     1:       Doe       Jane Street_Address 4321 Tower St 4111 Tower St
#     2:       Doe       Jane            ZIP         54321         32132
#     3:       Doe       Jane     VisitCount            10            17

where input data sets are:

# Input Data Sets
d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", 
"Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", 
"771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 
23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", 
"First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", 
"Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", 
"771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 
17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", 
"ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • How about `melt(dt1, id.vars=c("..."))`? – BrodieG Jan 20 '15 at 23:41
  • From this: http://stackoverflow.com/a/18450519/1898580, I got the idea that `unlist` might be a faster solution – Marat Talipov Jan 20 '15 at 23:43
  • Haven't tested it, but that was likely using the slow `reshape`, not the **[new one added](https://github.com/Rdatatable/data.table#changes-in-v192-on-cran-27-feb-2014)** to `data.table` – BrodieG Jan 20 '15 at 23:46
  • I appreciate both answers, and will continue experimenting. I'm still a bit nervous about how the solution will scale though -- feels like converting to key-value BEFORE comparison can cause large datasets to explode. – Netbrian Jan 22 '15 at 17:34
  • I appreciate both answers, and will continue experimenting. I'm still a bit nervous about how the solution will scale though -- feels like converting to key-value BEFORE comparison can cause large datasets to explode. – Netbrian Jan 22 '15 at 17:34
  • I'm really curious about the performance aspect of this problem. Please let us know if you have a chance to benchmark data.table/dplyr solutions – Marat Talipov Jan 22 '15 at 17:42
7

dplyr and tidyr work well here. First, a slightly reduced dataset:

dat1 <- data.frame(Last_Name = c('Doe', 'Doe', 'Don', 'Smith'),
                   First_Name = c('John', 'Jane', 'Bob', 'Mike'),
                   ZIP = c(12345, 54321, 23232, 77777),
                   VisitCount = c(20, 10, 5, 3),
                   stringsAsFactors = FALSE)
dat2 <- data.frame(Last_Name = c('Doe', 'Doe', 'Donn'),
                   First_Name = c('John', 'Jane', 'Bob'),
                   ZIP = c(12345, 32132, 11111),
                   VisitCount = c(20, 17, 5),
                   stringsAsFactors = FALSE)

(Sorry, I didn't want to type it all in. If it's important, please provide a reproducible example with well-defined data structures.)

Additionally, it looks like your "desired output" is a little off with Jane Doe's ZIP and VisitCount.

Your thought to melt them works well:

library(dplyr)
library(tidyr)
dat1g <- gather(dat1, key, value, -Last_Name, -First_Name)
dat2g <- gather(dat2, key, value, -Last_Name, -First_Name)
head(dat1g)
##   Last_Name First_Name        key value
## 1       Doe       John        ZIP 12345
## 2       Doe       Jane        ZIP 54321
## 3       Don        Bob        ZIP 23232
## 4     Smith       Mike        ZIP 77777
## 5       Doe       John VisitCount    20
## 6       Doe       Jane VisitCount    10

From here, it's deceptively simple:

dat1g %>%
    inner_join(dat2g, by = c('Last_Name', 'First_Name', 'key')) %>%
    filter(value.x != value.y)
##   Last_Name First_Name        key value.x value.y
## 1       Doe       Jane        ZIP   54321   32132
## 2       Doe       Jane VisitCount      10      17
Community
  • 1
  • 1
r2evans
  • 141,215
  • 6
  • 77
  • 149
3

The dataCompareR package aims to solve this exact problem. The vignette for the package includes some simple examples, and I've used this package to solve the original problem below.

Disclaimer: I was involved with creating this package.

library(dataCompareR)

d1 <- structure(list(Last_Name = c("Doe", "Doe", "Don", "Smith"), First_Name = c("John", "Jane", "Bob", "Mike"), Street_Address = c("1234 Main St", "4321 Tower St", "771  North Ave", "732 South Blvd."), ZIP = c(12345L, 54321L, 23232L, 77777L), VisitCount = c(20L, 10L, 5L, 3L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -4L))                                                                                                               

d2 <- structure(list(Last_Name = c("Doe", "Doe", "Donn"), First_Name = c("John", "Jane", "Bob"), Street_Address = c("1234 Main St", "4111 Tower St", "771  North Ave"), ZIP = c(12345L, 32132L, 11111L), VisitCount = c(20L, 17L, 5L)), .Names = c("Last_Name", "First_Name", "Street_Address", "ZIP", "VisitCount"), class = "data.frame", row.names = c(NA, -3L))

compd1d2 <- rCompare(d1, d2, keys = c("First_Name", "Last_Name"))

print(compd1d2)

All columns were compared, 3 row(s) were dropped from comparison
There are  3 mismatched variables:
First and last 5 observations for the  3 mismatched variables
FIRST_NAME LAST_NAME        valueA        valueB       variable     typeA  typeB diffAB
1       Jane       Doe 4321 Tower St 4111 Tower St STREET_ADDRESS character character       
2       Jane       Doe            10            17     VISITCOUNT   integer   integer     -7
3       Jane       Doe         54321         32132            ZIP   integer   integer  22189

To get a more detailed and pretty summary, the user can run

summary(compd1d2)

The use of FIRST_NAME and LAST_NAME as the 'join' between the two tables is controlled by the keys = argument to the rCompare function. In this case any rows that do not match on these two variables are dropped from the comparison, but you can get a more detailed output on the comparison performed by using summary

RNE
  • 41
  • 4
  • This is a very useful package! I was looking for something to mimic the bash command "diff", but without having to call out to my OS. This fits the bill! – Keegan Smith Oct 18 '17 at 15:08