3

This is a seemingly basic question, I apologize in advance if this is a duplicate question. I looked around and didn't see anything.

I have two dataframes full of strings. I'd like to see if they are EXACT duplicates of each other.

If they are not, I'd like to determine which values are different.

Specifically, given this dataframe:

| x | y |
|---|---|
| a | e |
| b | f |
| c | g |
| d | h |

and this dataframe:

| x | y |
|---|---|
| a | l |
| b | m |
| j | g |
| k | h |

I would like to generate this result (a df full of non-matching values):

| x | y |
|---|---|
|   | l |
|   | m |
| j |   |
| k |   |

This question is super close to what I'm thinking, but it wants to find full rows that are the same, not values.

1) I don't think I have any choice other than to iterate across each value, one by one, testing via string matching. I know this df1 %in% df2 will test for rows. But how do I test for each element?

2) After I can test each element, I'd need to construct a dataframe to store the non-matches. I'm not sure how to do it.

It seems like a simple idea, but breaking it down, the implementation actually seems rather complex. Any bumps in the right direction would be greatly appreciated.

My data:

df1 <- data.frame(
  x = c('a', 'b', 'c', 'd'),
  y = c('e', 'f', 'g', 'h')
)


df2 <- data.frame(
  x = c('a', 'b', 'j', 'k'),
  y = c('l', 'm', 'g', 'h')
)
Community
  • 1
  • 1
Monica Heddneck
  • 2,973
  • 10
  • 55
  • 89

1 Answers1

3

You could do:

df2[mapply(function(x,y)   x%in%y ,df1,df2)]<-NA
     x    y
1 <NA>    l
2 <NA>    m
3    j <NA>
4    k <NA>

This affects df2 directly, better have a copy of it.

Explanation:
mapply() is used to have the %in% applied between the first column of df1 and df2, and then the second and so on if there were more.
This gives:

> mapply(function(x,y)   x%in%y,df1,df2)
         x     y
[1,]  TRUE FALSE
[2,]  TRUE FALSE
[3,] FALSE  TRUE
[4,] FALSE  TRUE

TRUE are the values that matched, these are the want we want to change into NA's.

Haboryme
  • 4,611
  • 2
  • 18
  • 21
  • I think an equivalent but simpler version of this would be to replace `mapply(...)` with `df1 == df2`, which should also evaluate element by element – Chrisss Jan 18 '17 at 20:38
  • Indeed. Using by row loops when this could be easily vectorized seems unnecessary – David Arenburg Jan 18 '17 at 20:47
  • @Chrisss & @David Arenburg: df1==df2 gives me the following error: `Error in Ops.factor(left, right) : level sets of factors are different` ,hence my choice. – Haboryme Jan 18 '17 at 20:52
  • 1
    I suppose I could have just added `stringsAsFactors = F` to solve the problem, silly me. – Haboryme Jan 18 '17 at 21:05
  • I used `df1[df1 != df2]` but I lose the rectangularity of the result -- the non-matches collapses into a flat vector of characters: `chr [1:4] "c" "d" "e" "f"` Any idea how I can return the result in a dataframe format? – Monica Heddneck Jan 18 '17 at 23:49
  • @MonicaHeddneck You have to do something like `df1[df1==df2]<-""` if you want to keep the format, otherwise you will only get a vector. – Haboryme Jan 19 '17 at 06:46