1

I have a dataframe in R with two columns:

  sampleID        annotation
    A1            orange; apple
    A2            apple; apple
    A3            apple; orange; orange; grapes; apple
    A4            grapes; orange

I would like to split the annotation column by the ";" delimiter and retain the ones that are unique and get the output as follows:

  sampleID        annotation
    A1            orange; apple
    A2            apple
    A3            apple; orange; grapes
    A4            grapes; orange
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
biobudhan
  • 289
  • 1
  • 2
  • 11
  • Possible duplicate of https://stackoverflow.com/questions/75494268/is-there-a-way-to-to-eliminate-duplicate-strings-inside-a-column-value-please/75494284#75494284 – akrun Feb 20 '23 at 17:27

1 Answers1

3

For each element in data$annotation, split the element, take the unique values, and paste back to a single string (optional) if you want a vector in each element).

base R:

lapply(data$annotation, \(x) paste(unique(strsplit(x, "; ")[[1]]), collapse = "; "))

tidyverse:

library(purrr)
library(dplyr)
library(stringr)
data %>% 
  mutate(annotation = map(annotation, ~ str_flatten(str_unique(str_split_1(.x, "; ")), "; ")))
Maël
  • 45,206
  • 3
  • 29
  • 67