Removing duplicate words in a string in R

Question

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?

That seems like a good solution, although you may want to `gsub` out the punctuation, otherwise e.g. "code?" in the example sentence would not be marked a duplicate of an earlier standalone "code". — Thomas, Apr 23 '14 at 14:38

score 12 · Answer 1 · answered Oct 15 '14 at 15:35

If you are still interested in alternate solutions you can use unique which slightly simplifies your code.

paste(unique(d), collapse = ' ')

As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.

d <- gsub("[[:punct:]]", "", d)

score 8 · Answer 2 · answered Dec 22 '16 at 09:44

There are no need additional package

str <- c("How do I best try and try and try and find a way to to improve this code?",
         "And and here's a second one one and not a third One.")

Atomic function:

rem_dup.one <- function(x){
  paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")

Vectorize

rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)

REsult

"how do i best try and find a way to improve this code" "and here's a second one not third"

score 5 · Answer 3 · edited Sep 15 '21 at 16:18

To remove duplicate words except for any special characters. use this function

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

Input data:

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)

It will treat "Samsung" and "SAMSUNG" as duplicate

score 2 · Answer 4 · edited Apr 09 '16 at 20:49

2

I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:

str <- c("How do I best try and try and try and find a way to to improve this code?",
    "And and here's a second one one and not a third One.")

library(qdap)
library(dplyr) # so that pipe function (%>% can work) 

str %>% 
    tolower() %>%
    word_split() %>% 
    sapply(., function(x) unbag(unique(x))) %>% 
    rm_white_endmark() %>%  
    rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
    unname()

## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

edited Apr 09 '16 at 20:49

Aadhya Manu Anand

87
1
10

answered Oct 18 '14 at 01:32

Tyler Rinker

108,132
65
322
519

what is word_split() function, when tried to run this code, it threw error on word_split(). – Aadhya Manu Anand Apr 09 '16 at 20:40
Did you install the **qdap** package? – Tyler Rinker Apr 09 '16 at 20:48
!Ahh. I got it. I didn't notice that i have problen with rJava package which is a kind of dependency for qdap. Thanks @Rinker. – Aadhya Manu Anand Apr 09 '16 at 22:40

Removing duplicate words in a string in R

4 Answers4

Linked