0

I have a data frame that looks like this

1 TSS1500
2 Body;TSS1500
3 Body;Body;Body

I want to remove duplicate entries from every row so that it look like this

1 TSS1500
2 Body;TSS1500
3 Body

Thanks a lot for the help.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Goku
  • 1
  • 2
  • Including a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question will increase your chances of getting an answer. – Samuel Nov 22 '17 at 21:26
  • Reproducible example would help, however, look into this one and remove the NAs: https://stackoverflow.com/questions/42142260/assign-nas-to-duplicates-in-each-row-after-first-occurence – yrx1702 Nov 22 '17 at 21:27
  • x<-structure(list(V1 = c("TSS1500", "Body;TSS1500", "Body;Body;Body" ), New = c("TSS1500", "Body;TSS1500", "Body")), .Names = c("V1", "New"), row.names = c(NA, -3L), class = "data.frame"); x[["New"]]<-sapply(lapply(strsplit(x$V1,split=";"), function(x) unique(x)), paste,collapse=";") – JeanVuda Nov 22 '17 at 21:40
  • Thanks a lot I was able to separate using df = cSplit(data, "col1", sep = ";"). And then df = t(apply(df, 1, FUN = function(x) replace(x, duplicated(x), NA))) – Goku Nov 22 '17 at 21:42

1 Answers1

0

It looks like you have a data frame with only two columns. The first column is an ID from 1 to 3, while the second column contains strings. Each string contains ";" to separate words and you want to remove duplicated words.

I believe the cSplit you proposed in your comment, which is probably from the splitstackshape package, is a good approach. After you split the words by ";", you will be able to use the solution from Mr. Zen's post.

Here I provide another approach using dplyr and tidyr. The idea is to use separate_row to "expand" each record by splitting the ;. After that, group_by and distinct can remove those duplicates. The final step, summarise with paste0 and setting collapse = ";", allows us to convert the data frame back to the original format.

### Create example data frame
dt <- read.table(text = "1 TSS1500
2 'Body;TSS1500'
                 3 'Body;Body;Body'",
                 header = FALSE, stringsAsFactors = FALSE)
dt

#   V1             V2
# 1  1        TSS1500
# 2  2   Body;TSS1500
# 3  3 Body;Body;Body

# Load packages
library(dplyr)
library(tidyr)

dt2 <- dt %>%
  # Expland each record by spliting ";"
  separate_rows(V2) %>%
  # Grouping based on V1
  group_by(V1) %>%
  # Remove duplicated rows
  distinct(V2) %>%
  # Combine rows from the same group together
  summarise(V2 = paste0(V2, collapse = ";"))
dt2
# # A tibble: 3 x 2
#      V1           V2
#   <int>        <chr>
# 1     1      TSS1500
# 2     2 Body;TSS1500
# 3     3         Body  

Next time, if you want to ask a new question, please consider using dput or other ways to share a minimal and reproducible. This allows others better to assist you. You can see that by just looking at the example you posted, it is not entirely clear if it is a data frame or how many columns do you have. It also takes extra time to recreate your dataset, and there is no guarantee that the recreated dataset is the same as yours.

www
  • 38,575
  • 12
  • 48
  • 84