9

I have a dataframe made like this:

  X Y  Z T
  1 2  4 2
  3 2  1 4
  7 5 NA 3

After several steps (not important which one) i obtained this df:

  X Y Z T
  1 2 4 2
  3 2 NA 4
  7 5 NA 3

i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:

 X  Y  Z  T
 1  2  4  2
 7  5  NA 3

How could I do?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Silvia
  • 405
  • 4
  • 17
  • 2
    Just a `merge` isn't it - `merge(dt1,dt2)` ? – thelatemail Sep 11 '17 at 10:32
  • @zx8754 this might not be a duplicate as the join might not be a correct answer, depending on the structure of the data. – ira Sep 11 '17 at 11:28
  • @ira based on example data provided, and expected output, it is a simple `merge` by all columns, pointed out above by thelatemail. – zx8754 Sep 11 '17 at 11:31
  • @zx8754 the example might be simplified. This line `i want to obtain a new dataframe made by only the rows which didn't change during the steps; ` does not appear to request a result of a join. – ira Sep 11 '17 at 12:24
  • @zx8754 no, merge doesn't give me the result that I expect. Akrun gave me the solution – Silvia Sep 11 '17 at 15:52
  • I think it could not be classified as a duplicate of a "merge" question. This is only my humble vision :) – Silvia Sep 11 '17 at 15:55

4 Answers4

4

One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset

dfO[do.call(paste, dfO) == do.call(paste, df),]
#   X Y  Z T
#1 1 2  4 2
#3 7 5 NA 3

where 'dfO' is the old dataset and 'df' is the new

akrun
  • 874,273
  • 37
  • 540
  • 662
4

You can use dplyr's intersect function:

library(dplyr)
intersect(d1, d2)
#  X Y  Z T
#1 1 2  4 2
#2 7 5 NA 3

This is a data.frame-equivalent of base R's intersect function.

In case you're working with data.tables, that package also provides such a function:

library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
#   X Y  Z T
#1: 1 2  4 2
#2: 7 5 NA 3
talat
  • 68,970
  • 21
  • 126
  • 157
3

Another dplyr solution: semi_join.

dt1 %>% semi_join(dt2, by = colnames(.))
  X Y  Z T
1 1 2  4 2
2 7 5 NA 3

Data

dt1 <- read.table(text = "X Y  Z T
  1 2  4 2
  3 2  1 4
  7 5 NA 3",
                  header = TRUE, stringsAsFactors = FALSE)

dt2 <- read.table(text = "  X Y Z T
  1 2 4 2
                  3 2 NA 4
                  7 5 NA 3",
                  header = TRUE, stringsAsFactors = FALSE)
www
  • 38,575
  • 12
  • 48
  • 84
3

I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.

From this perspective, I think the only correct one so far is akrun's.

You could also do something like:

df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]

But I think akrun's way is more elegant and likely to perform better in terms of speed.

ira
  • 2,542
  • 2
  • 22
  • 36
  • That being said, all of the answers might be ok if you don't have duplicate rows or if you do not care about order of the rows... – ira Sep 11 '17 at 11:24