I have a data frame that looks like this
1 TSS1500
2 Body;TSS1500
3 Body;Body;Body
I want to remove duplicate entries from every row so that it look like this
1 TSS1500
2 Body;TSS1500
3 Body
Thanks a lot for the help.
I have a data frame that looks like this
1 TSS1500
2 Body;TSS1500
3 Body;Body;Body
I want to remove duplicate entries from every row so that it look like this
1 TSS1500
2 Body;TSS1500
3 Body
Thanks a lot for the help.
It looks like you have a data frame with only two columns. The first column is an ID from 1 to 3, while the second column contains strings. Each string contains ";" to separate words and you want to remove duplicated words.
I believe the cSplit
you proposed in your comment, which is probably from the splitstackshape
package, is a good approach. After you split the words by ";", you will be able to use the solution from Mr. Zen's post.
Here I provide another approach using dplyr
and tidyr
. The idea is to use separate_row
to "expand" each record by splitting the ;
. After that, group_by
and distinct
can remove those duplicates. The final step, summarise
with paste0
and setting collapse = ";"
, allows us to convert the data frame back to the original format.
### Create example data frame
dt <- read.table(text = "1 TSS1500
2 'Body;TSS1500'
3 'Body;Body;Body'",
header = FALSE, stringsAsFactors = FALSE)
dt
# V1 V2
# 1 1 TSS1500
# 2 2 Body;TSS1500
# 3 3 Body;Body;Body
# Load packages
library(dplyr)
library(tidyr)
dt2 <- dt %>%
# Expland each record by spliting ";"
separate_rows(V2) %>%
# Grouping based on V1
group_by(V1) %>%
# Remove duplicated rows
distinct(V2) %>%
# Combine rows from the same group together
summarise(V2 = paste0(V2, collapse = ";"))
dt2
# # A tibble: 3 x 2
# V1 V2
# <int> <chr>
# 1 1 TSS1500
# 2 2 Body;TSS1500
# 3 3 Body
Next time, if you want to ask a new question, please consider using dput
or other ways to share a minimal and reproducible. This allows others better to assist you. You can see that by just looking at the example you posted, it is not entirely clear if it is a data frame or how many columns do you have. It also takes extra time to recreate your dataset, and there is no guarantee that the recreated dataset is the same as yours.