Is there a way to to eliminate duplicate strings inside a column value? (Please check the details below, I really don't know how to put it concisely)

Question

Here is my dataframe

  `mutations   Pos Dataset      percentage_occurance newCol                                                        
  <chr>     <dbl> <chr>                       <dbl> <chr>                                                         
1 P323L       323 jan_jun_2021              99.2    P323L, D614G, P323L, D614G                                    
2 D614G       614 jan_jun_2021              99.9    P323L, D614G, P323L, D614G                                    
3 D279N       279 jan_jun_2021               6.30   D279N, S194L, N440K, R52I, S235F, L126F,
E261stop, S97I, S2P,…

I basically want to remove the diuplicates inside the column newCol and get a dataframe with the same columns. So the output should look like this:

mutations   Pos Dataset      percentage_occurance newCol                                                        
  <chr>     <dbl> <chr>                       <dbl> <chr>                                                         
1 P323L       323 jan_jun_2021              99.2    P323L, D614G                                    
2 D614G       614 jan_jun_2021              99.9    P323L, D614G

The way I have been trying to do this is:


head(data7) %>% mutate(newCol = str_split(newCol, ", "))

But it gives me a list inside the newCol column:

# A tibble: 6 × 5
# Groups:   mutations, Dataset, Pos [6]
  mutations   Pos Dataset      percentage_occurance newCol       
  <chr>     <dbl> <chr>                       <dbl> <list>       
1 P323L       323 jan_jun_2021              99.2    <chr [4]>    
2 D614G       614 jan_jun_2021              99.9    <chr [4]>    
3 D279N       279 jan_jun_2021               6.30   <chr [55]>   
4 A598S       598 jan_jun_2021               0.0538 <chr [3,157]>
5 P681H       681 jan_jun_2021               9.92   <chr [34]>   
6 G204R       204 jan_jun_2021              13.3    <chr [21]>

Is there any way to get my desired output as I have a dataframe of 3890 rows and I want to do this for all of them?

I am new to stack overflow please forgive me for any mistakes I would have made while posing the question. Thanks in advance :).

Is there a way to to eliminate duplicate strings inside a column value? (Please check the details below, I really don't know how to put it concisely)

0 Answers0

Linked