2

Hi I would like to split a delimiter into new rows: there are a few similar posts on stackoverflow, however I can't find one that addresses issues with eliminating duplicates. I've tried several ways.

df <- read.table(text= 'sample, GENE1
s1 A,B
s4 B,A,A,C,C'
, header = TRUE, stringsAsFactors = FALSE)

df %>%
  mutate(GENE1b =  unique(strsplit(as.character(GENE1), ",")) )  %>%
  unnest(GENE1b)    

code above will only produce

# A tibble: 7 x 3
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C A     
6 s4      B,A,A,C,C C     
7 s4      B,A,A,C,C C    

which is incorrect. since s4 should only contain new rows for B,A,C and no duplicates. Of course I can always remove the duplicates afterwards but I'm wondering if there is a way to do it in one go. I also tried converting it back with paste ( x, collapse=",") but this also failed.

Ahdee
  • 4,679
  • 4
  • 34
  • 58
  • `strsplit` returns a list; [this similar post](https://stackoverflow.com/q/54078407/5325862) goes over that. Also see [this one](https://stackoverflow.com/q/38668886/5325862) – camille Sep 16 '21 at 17:33

2 Answers2

2

It may be easier to do this with separate_rows i.e. after the expansion, apply distinct on both columns

library(dplyr)
library(tidyr)
df %>% 
    mutate(GENE1b = GENE1) %>%
    separate_rows(GENE1b) %>% 
    distinct(sample., GENE1b, .keep_all = TRUE)

-output

# A tibble: 5 x 3
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C C      

Regarding the OP's issue with unique - unique expects a vector as input whereas strsplit output is a list. Therefore, we may need to loop over the list or use unlist (but this will not work as unlist will get all the list elements to a single vector.

library(dplyr)
library(purrr)
library(tidyr)
df %>%
    mutate(GENE1b = map(strsplit(GENE1, ","), unique)) %>%
    unnest(GENE1b)

-output

# A tibble: 5 x 3
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C C   
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks; I was stump on the second step, tried unlist and failed as well so thanks for explaining this. For the first approach with distinct, is there any way to keep all the columns? So say I have 4 additional columns, using mutate as with the 2nd approach would keep the columns. – Ahdee Sep 16 '21 at 17:23
  • @Ahdee updatd. You just need a copy of the column to be splitted and then when you use `distinct` add `.keep_all = TRUE` (if there are other columns) – akrun Sep 16 '21 at 17:25
  • thank you. Its weird because the .keep_all was not in the help doc. Very useful. Just out of curiosity which of the approach do you prefer? – Ahdee Sep 16 '21 at 17:44
  • @Ahdee depends on whether you are looking for speed or elegance. `strsplit` is fast, but unnest may make it slow. similarly `separate_rows` may be slow. – akrun Sep 16 '21 at 17:45
  • 1
    thank you that was educational. Glad to have learn something today. – Ahdee Sep 16 '21 at 17:48
2

Add group_by(sample.) %>% distinct() to your code:

library(dplyr)
library(tidyr)

    df %>%
        mutate(GENE1b =  unique(strsplit(as.character(GENE1), ",")) )  %>%
        unnest(GENE1b) %>% 
        group_by(sample.) %>% 
        distinct()    
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C C   
TarJae
  • 72,363
  • 6
  • 19
  • 66