3

Example that works:

df <- data.frame(c0=c(1, 2), c1=c("A,B,C", "D,E,F"), c2=c("B,C", "D,E"))
df
#   c0    c1  c2
# 1  1 A,B,C B,C
# 2  2 D,E,F D,E

# Add a column d with difference between c1 and c2
df %>% mutate(d=setdiff(unlist(strsplit(as.character(c1), ",")), unlist(strsplit(as.character(c2), ","))))

#   c0    c1  c2 d
# 1  1 A,B,C B,C A
# 2  2 D,E,F D,E F

I get what I expected above: d is assigned the difference between these two lists of characters (they are already sorted).

However, if I introduce more than one different character it no longer works:

df <- data.frame(c0=c(1, 2), c1=c("A,B,C", "D,E,F,G"), c2=c("B,C", "D,E"))
df
#   c0      c1  c2
# 1  1   A,B,C B,C
# 2  2 D,E,F,G D,E

# Add a column d with difference between c1 and c2
df %>% mutate(d=setdiff(unlist(strsplit(as.character(c1), ",")), unlist(strsplit(as.character(c2), ","))))
Error: wrong result size (3), expected 2 or 1

What I wanted to get there is:

  c0    c1    c2  d
1  1 A,B,C    B,C A
2  2 D,E,F,G  D,E F,G

I've tried adding a paste() around setdiff but that didn't help. In the end I actually want to be able to probably use tidyr::separate to split out the d column into new rows like:

  c0    c1    c2  d
1  1 A,B,C    B,C A
2  2 D,E,F,G  D,E F
3  2 D,E,F,G  D,E G

What am I doing wrong with the setdiff above?

Thanks

Tim

Tim
  • 65
  • 5

1 Answers1

1

You get the error because at row 2 you have more than one element which can not fit a cell, one way is to use rowwise and wrap the result as list so that it can fit and after that use unnest from tidyr to expand the list type column:

library(dplyr)
library(tidyr)
df %>% 
      rowwise() %>% 
      mutate(d=list(setdiff(unlist(strsplit(as.character(c1), ",")), 
                            unlist(strsplit(as.character(c2), ","))))) %>% 
      unnest()

# Source: local data frame [3 x 4]

#      c0      c1     c2     d
#   <dbl>  <fctr> <fctr> <chr>
# 1     1   A,B,C    B,C     A
# 2     2 D,E,F,G    D,E     F
# 3     2 D,E,F,G    D,E     G
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • I've not come across rowwise and unnest before. To check I understand: rowwise makes subsequent summarise and mutate operations work within each row (when I tried paste as above without the rowwise it joined values from all rows). unnest does what I was proposing `tidyr::separate` for - duplicates rows for each list element in d. I found this post useful for unnest: http://bioinfoblog.it/2015/02/the-most-useful-r-command-unnest-from-tidyr/comment-page-1/. Thanks again @Psidom – Tim Aug 23 '16 at 23:44
  • The paste will fail because it is vectorized and will thus treat a column as a vector if there is no rowwise to constrain the operation to be by row; `separate` is meant to split a column into multiple columns, while `unnest` expand a column where each element is a list as in this case. – Psidom Aug 24 '16 at 01:12
  • Note see my related question here: http://stackoverflow.com/questions/39883000/dplyr-grouped-cumulative-set-counting-using-group-by-and-rowwise-do – Tim Oct 05 '16 at 23:41