Create a new dataframe according to the contrast between two similar df

Question

I have a dataframe made like this:

After several steps (not important which one) i obtained this df:

i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:

 X  Y  Z  T
 1  2  4  2
 7  5  NA 3

How could I do?

@zx8754 this might not be a duplicate as the join might not be a correct answer, depending on the structure of the data. — ira, Sep 11 '17 at 11:28
@ira based on example data provided, and expected output, it is a simple `merge` by all columns, pointed out above by thelatemail. — zx8754, Sep 11 '17 at 11:31
@zx8754 the example might be simplified. This line `i want to obtain a new dataframe made by only the rows which didn't change during the steps; ` does not appear to request a result of a join. — ira, Sep 11 '17 at 12:24
@zx8754 no, merge doesn't give me the result that I expect. Akrun gave me the solution — Silvia, Sep 11 '17 at 15:52
I think it could not be classified as a duplicate of a "merge" question. This is only my humble vision :) — Silvia, Sep 11 '17 at 15:55

score 4 · Accepted Answer · answered Sep 11 '17 at 10:19

One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset

dfO[do.call(paste, dfO) == do.call(paste, df),]
#   X Y  Z T
#1 1 2  4 2
#3 7 5 NA 3

where 'dfO' is the old dataset and 'df' is the new

score 4 · Answer 2 · answered Sep 11 '17 at 10:21

You can use dplyr's intersect function:

library(dplyr)
intersect(d1, d2)
#  X Y  Z T
#1 1 2  4 2
#2 7 5 NA 3

This is a data.frame-equivalent of base R's intersect function.

In case you're working with data.tables, that package also provides such a function:

library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
#   X Y  Z T
#1: 1 2  4 2
#2: 7 5 NA 3

score 3 · Answer 3 · answered Sep 11 '17 at 10:25

Another dplyr solution: semi_join.

dt1 %>% semi_join(dt2, by = colnames(.))
  X Y  Z T
1 1 2  4 2
2 7 5 NA 3

Data

dt1 <- read.table(text = "X Y  Z T
  1 2  4 2
  3 2  1 4
  7 5 NA 3",
                  header = TRUE, stringsAsFactors = FALSE)

dt2 <- read.table(text = "  X Y Z T
  1 2 4 2
                  3 2 NA 4
                  7 5 NA 3",
                  header = TRUE, stringsAsFactors = FALSE)

score 3 · Answer 4 · answered Sep 11 '17 at 11:21

3

I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.

From this perspective, I think the only correct one so far is akrun's.

You could also do something like:

df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]

But I think akrun's way is more elegant and likely to perform better in terms of speed.

answered Sep 11 '17 at 11:21

ira

2,542
2
22
36

That being said, all of the answers might be ok if you don't have duplicate rows or if you do not care about order of the rows... – ira Sep 11 '17 at 11:24

Create a new dataframe according to the contrast between two similar df

4 Answers4