R - difference between 2 sets in data frame

Question

I have 2 factor columns, I want to create a third column which tells me what the second one has that the first does not. It's very similar to this post but I'm having trouble going from a df to using setdiff() function.
For example:

library(dplyr)
y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.","a.b.","b.c.d.")
df <- data.frame(y1,y2)

Column y1 has a.b. and column y2 has a.b.c.. I want a thirds column to return c. or just c.

> df
      y1     y2  col3
1   a.b.  a.b.c.  c.
2     a.    a.b.  b.
3 b.c.d.  b.c.d.

I think that is should be a combination of strsplit and setdiff, but I can't get it to work.

I've tried to convert the factor into character, then I've tried applying strsplit() to the results, but the output seems a but weird to me. It seems to have created a list within a list, which makes it difficult to pass to setdiff()

#convert factor to character
df <- df %>% mutate_if(is.factor, as.character)
lapply(df$y1,function(x)(strsplit(x,split = "[.]")))

> lapply(df$y1,function(x)(strsplit(x,split = "[.]")))
[[1]]
[[1]][[1]]
[1] "a" "b"


[[2]]
[[2]][[1]]
[1] "a"


[[3]]
[[3]][[1]]
[1] "b" "c" "d"

What about df %>%rowwise()%>% mutate(col3 = gsub(y1,"",y2)). Problem being that if y1 has extra characters y2 does not it won't work. but just an idea of a potentially simpler solution — Sarah, Apr 18 '18 at 01:12
Actually this produces correct results. I actually need to to show whats different in y2 that is not in y1. I think all other solutions do the same thing. You can put this as a solution instead of comment. — jmich738, Apr 18 '18 at 01:42
One issue with using `df %>%rowwise()%>% mutate(col3 = gsub(y1,"",y2))` is if the order is changed it won't work. Consider if `y1` has `a.b` and `y2` has `b.a.c`. — Ronak Shah, Apr 18 '18 at 01:46

Ronak Shah · Accepted Answer · 2018-04-18T02:51:31.567

5

Update

There was an issue when the difference had more than 1 character, it created an additional row. To overcome that we paste all the elements together for each difference. This also saves us from the unlist step.

df$col3 <- mapply(function(x, y) paste0(setdiff(y, x), collapse = ""),
   strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))

Original Answer

We can use mapply and split both the columns on "." using strsplit and then take the difference between them using setdiff.

df$col3 <- mapply(function(x, y) setdiff(y, x),
       strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))

df
#     y1     y2 col3
#1   a.b. a.b.c.    c
#2     a.   a.b.    b
#3 b.c.d. b.c.d.

If we don't want col3 as list we can unlist it however, one issue in that is if we unlist it removes the character(0) value from it. To retain that value we need to perform an additional check on it. Taken from here.

unlist(lapply(df$col3,function(x) if(identical(x,character(0))) ' ' else x))

#[1] "c" "b" " "

edited Apr 18 '18 at 02:51

answered Apr 18 '18 at 01:13

Ronak Shah

377,200
20
156
213

is there any way to convert `col3` into just a normal column? When I run `str(df)` it return `col3` as a `List of 3` – jmich738 Apr 18 '18 at 01:19
@jmich738 added in the main answer. – Ronak Shah Apr 18 '18 at 01:25
I'm trying to apply this to my whole dataset but it seems that the output of `col3` produces less rows than the original `df`. I'm still not sure where the problem lies. – jmich738 Apr 18 '18 at 02:06
@jmich738 I hope you are doing this in two step. First do the `mapply` step and then do the `unlist` one. – Ronak Shah Apr 18 '18 at 02:07
It seems to be the `unlist()` thats causing the issue. The `unlist` produces extra rows. What I'm doing is saying `df$col3<- unlist(...)`, but on my actual dataset. I'm still trying to figure out how my sample data is different from my actual data. – jmich738 Apr 18 '18 at 02:16
the problem arises when the difference between the two sets is more than 1 characted. If you change `y2 <- c("a.b.c.d.","a.b.","b.c.d.")` then `unlist()` will create an extra row. – jmich738 Apr 18 '18 at 02:46
@jmich738 Apologies, I should have thought about that scenario. Anyway, have updated the answer and it should be fine now. It also reduces one step. – Ronak Shah Apr 18 '18 at 02:52
you are on fire my friend! – jmich738 Apr 18 '18 at 02:57

Maurits Evers · Answer 2 · 2018-04-18T03:12:59.777

4

You can also use purrr:map2:

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff))
#      y1     y2 col3
#1   a.b. a.b.c.    c
#2     a.   a.b.    b
#3 b.c.d. b.c.d.

Explanation: Convert factors to character vectors, use setdiff on the "."-split columns y2 and y1. Note that col3 is a list.

Update

It appears that unnest drops the zero-length character entries from the list. So to convert col3 from a list to a character vector you can do:

df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>%
    rowwise() %>%
    mutate(col3 = paste(col3, collapse = "."))
## A tibble: 3 x 3
#  y1     y2     col3
#  <chr>  <chr>  <chr>
#1 a.b.   a.b.c. c
#2 a.     a.b.   b
#3 b.c.d. b.c.d. ""

The idea here is to string-concatenate col3 entries (if there are multiple); using rowwise() ensures row-wise paste.

For the updated sample data from your comment:

y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.e.","a.b.","b.c.d.")
df <- data.frame(y1,y2)
df %>%
    mutate_if(is.factor, as.character) %>%
    mutate(col3 = map2(strsplit(y2, "\\."), strsplit(y1, "\\."), setdiff)) %>%
    rowwise() %>%
    mutate(col3 = paste(col3, collapse = "."))
## A tibble: 3 x 3
#  y1     y2       col3
#  <chr>  <chr>    <chr>
#1 a.b.   a.b.c.e. c.e
#2 a.     a.b.     b
#3 b.c.d. b.c.d.   ""

edited Apr 18 '18 at 03:12

answered Apr 18 '18 at 01:24

Maurits Evers

49,617
4
47
68

for some reason when I run this I do not get row 3, the one with no differences. Would you know wha that is? – jmich738 Apr 18 '18 at 01:39
@jmich738 - the `unnest()` removes any rows which are null in the list apparently. – thelatemail Apr 18 '18 at 01:42
@thelatemail ok, so if I run it without `unnest()` I get all the rows – jmich738 Apr 18 '18 at 01:46
@jmich738 and @thelatemail You're right! I hadn't realised that `unnest` drops the zero length `character` entries. Please see my updated solution. – Maurits Evers Apr 18 '18 at 01:54
@MauritsEvers very close, but seems that if he difference is more than 1 character, then the results are weird. if you set `y2 <- c("a.b.c.e.","a.b.","b.c.d.")`, then the output looks like `c("c", "e")` – jmich738 Apr 18 '18 at 02:04
@jmich738 Ah bugger! You're right again. I've made another edit. The key is to ensure row-wise `paste`ing. – Maurits Evers Apr 18 '18 at 03:04

score 3 · Answer 3 · answered Apr 18 '18 at 02:16

A very simple but not rigorous is to replace everything in y1 with "" from y2. This won't handle cases where the orders are different or if y1 has anything additional to y2 instead of the other way around.

df %>% rowwise() %>% mutate(col3 = gsub(y1,"",y2))

R - difference between 2 sets in data frame

3 Answers3

Update