Comparing two dataframes in R with different number of rows, seeking to find differences in cells

Question

So I have two dataframes each from a different year. The second year df has additional rows as well as updated values in some cells. My goal here is to get a new dataframe which shows just the things that have been added or changed, everything else can be 0,NA or removed.

Take a look at df 1 (y1):

project_ID  sequence  item         q1    q2    q3   q4
NA          NA        NA           NA    NA    NA   NA
NA          207       period       201h  202h  203h 204h     
NA          222       prepayment   1202  202.3 99   2922
2455        271       prepayment_2 1000  1000  1000 1000
2929        780       UPS          50    51    52   53
NA          NA        NA           NA    NA    NA   NA

So now a year goes by and I have this new data set (y2), note the differing rows and certain changed values.

project_ID  sequence  item         q1    q2    q3   q4
NA          NA        NA           NA    NA    NA   NA
NA          207       period       201h  202h  203h 204h     
NA          222       prepayment   1202  202.3 99   2922
2455        271       prepayment_2 999   999   1002 1000
3002        299       payment      500   500   500  500
2929        780       UPS          50    51    52   53 
NA          NA        NA           NA    NA    NA   NA

So I attempted to use the compare() function in library(compare) however this does not have the functionality I am looking for, to my knowledge.

cmp<- compare(df1,df2)
cmp$tM

but this doesn't really help me especially since the rows are different. Furthermore it would only tell me which are different without calculating the differences.

So what I would like to see is a new dataframe which looks like this:

project_ID  sequence  item         q1    q2    q3   q4



2455        271       prepayment_2 -1    -1     2   
3002        299       payment      500   500   500  500

Now this is the best way I can think of doing it but at it's core I just need a new df with just what has changed and the difference of the values that have changed; spacing is not that important and if it would be easier to lay it out differently I'm all ears.

EDIT: Here are the two df's for R.

y1<- structure(list(project_ID = c("NA", "NA", "NA", "2455", "2929", 
"NA"), sequence = c("NA", "207", "222", "271", "780", "NA"), 
    item = c("NA", "period", "prepayment", "prepayment_2", "UPS", 
    "NA"), q1 = c("NA", "201h", "1202", "1000", "50", "NA"), 
    q2 = c("NA", "202h", "202.3", "1000", "51", "NA"), q3 = c("NA", 
    "203h", "99", "1000", "52", "NA"), q4 = c("NA", "204h", "2922", 
    "1000", "53", "NA")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))


y2 <- structure(list(project_ID = c("NA", "NA", "NA", "2455", "3002", 
"2929", "NA"), sequence = c("NA", "207", "222", "271", "299", 
"780", "NA"), item = c("NA", "period", "prepayment", "prepayment_2", 
"payment", "UPS", "NA"), q1 = c("NA", "201h", "1202", "999", 
"500", "50", "NA"), q2 = c("NA", "202h", "202.3", "999", "500", 
"51", "NA"), q3 = c("NA", "203h", "99", "1002", "500", "52", "NA"
), q4 = c("NA", "204h", "2922", "1000", "500", "53", "NA")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

Have a look at the `*_join` family of functions from `dplyr` which are also available in data.table and base R. Also try to make a reprex so people can better help. — boshek, Jul 29 '19 at 21:10
Possible duplicate of [Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2](https://stackoverflow.com/questions/3171426/compare-two-data-frames-to-find-the-rows-in-data-frame-1-that-are-not-present-in) — Shree, Jul 29 '19 at 21:24
Your frames have strings for values, how do you propose to find numeric differences when you don't have numbers? — r2evans, Jul 29 '19 at 22:22

r2evans · Accepted Answer · 2019-07-31T15:18:42.520

As has been suggested, the *_join family of functions are useful, along with a smattering of reshaping from wide to long and back to wide.

Note: I'm assuming that anything that looks like it could be numeric is numeric, transforming "201h" to 201. (If this is not true, please update your sample data.)

library(dplyr)
library(tidyr)
full_join(
  gather(y1, q, val1, -project_ID, -sequence, -item) %>% mutate(in1 = TRUE),
  gather(y2, q, val2, -project_ID, -sequence, -item) %>% mutate(in2 = TRUE),
  by = c("project_ID", "sequence", "item", "q")
) %>%
  # mutate_at(vars(val1, val2), ~ as.numeric(gsub("[^.[:digit:]]", "", .))) %>%
  mutate_at(vars(val1, val2), ~ suppressWarnings(as.numeric(.))) %>%
  mutate(
    # valdiff = val2 - val1
    valdiff = case_when(
      is.na(val1) ~ val2,
      is.na(val2) ~ val1,
      TRUE ~ val2 - val1
    )
  ) %>%
  select(-val1, -val2) %>%
  distinct() %>%
  spread(q, valdiff)
# # A tibble: 6 x 9
#   project_ID sequence item         in1   in2      q1    q2    q3    q4
#   <chr>      <chr>    <chr>        <lgl> <lgl> <dbl> <dbl> <dbl> <dbl>
# 1 2455       271      prepayment_2 TRUE  TRUE     -1    -1  -898     0
# 2 2929       780      UPS          TRUE  TRUE      0     0     0     0
# 3 3002       299      payment      NA    TRUE    500   500   500   500
# 4 NA         207      period       TRUE  TRUE     NA    NA    NA    NA
# 5 NA         222      prepayment   TRUE  TRUE      0     0     0     0
# 6 NA         NA       NA           TRUE  TRUE     NA    NA    NA    NA

(I assume the difference in my output from your expected output is due to copy/paste issues in your data, perhaps the 102 in y2 is intended to be 1002?)

yeah this is awesome and the jist of what I was looking for. So the 201h, only one row will ever be like that and never changing between files. Therefore it may be ignored entirely every time. But that is how it is. — Johnny Thomas, Jul 30 '19 at 03:33
also, major issue. The row entitled "payment" found in y2 is not found in the code you created showing all differences. being that this table exists in y2 but not y1, it's not exactly right but this so far is near perfect! — Johnny Thomas, Jul 30 '19 at 04:02
Issue 1: fixed with the use of `full_join`. Issue 2: since your data has all strings (no numbers), it was just inference without guidance. I've adjusted my code to not try to fix the `h` thing. — r2evans, Jul 30 '19 at 17:56
this however still doesn't work because we do not have the value of the new row; we should have 500's in the new "payment" row which isn't happening :/ — Johnny Thomas, Jul 31 '19 at 14:06
R is very intentional when it comes to math where an `NA` is involved: `NA` often means *"it could be anything"*, due to any number of factors; it can also mean *"a value does not exist in that location"*, perhaps because it was not gather or it was removed. Either way, math operations with an `NA` default to `NA` as a result, forcing you to override and say *"I know why this is not available and why we should instead do something else instead"*. See my edit. — r2evans, Jul 31 '19 at 15:20

Comparing two dataframes in R with different number of rows, seeking to find differences in cells

1 Answers1