11

Just to help someone who's just voluntarily removed their question, following a request for code he tried and other comments. Let's assume they tried something like this:

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

and wanted to learn a better way. So what is the best way to remove a duplicate word from the string?

Greg
  • 9,068
  • 6
  • 49
  • 91
andrekos
  • 2,822
  • 4
  • 27
  • 26
  • That seems like a good solution, although you may want to `gsub` out the punctuation, otherwise e.g. "code?" in the example sentence would not be marked a duplicate of an earlier standalone "code". – Thomas Apr 23 '14 at 14:38

4 Answers4

12

If you are still interested in alternate solutions you can use unique which slightly simplifies your code.

paste(unique(d), collapse = ' ')

As per the comment by Thomas, you probably do want to remove punctuation. R's gsub has some nice internal patterns you can use instead of strict regex. Of course you can always specify specific instances if you want to do some more refined regex.

d <- gsub("[[:punct:]]", "", d)
cdeterman
  • 19,630
  • 7
  • 76
  • 100
8

There are no need additional package

str <- c("How do I best try and try and try and find a way to to improve this code?",
         "And and here's a second one one and not a third One.")

Atomic function:

rem_dup.one <- function(x){
  paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")

Vectorize

rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)

REsult

"how do i best try and find a way to improve this code" "and here's a second one not third" 
Edvardoss
  • 393
  • 3
  • 8
5

To remove duplicate words except for any special characters. use this function

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

Input data:

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

output: samsung wa80e5lec top loading with diamond drum 6 kg (silver)

It will treat "Samsung" and "SAMSUNG" as duplicate

camille
  • 16,432
  • 18
  • 38
  • 60
2

I'm not sure if string case is a concern. This solution uses qdap with the add-on qdapRegex package to make sure that punctuation and beginning string case doesn't interfere with the removal but is maintained:

str <- c("How do I best try and try and try and find a way to to improve this code?",
    "And and here's a second one one and not a third One.")

library(qdap)
library(dplyr) # so that pipe function (%>% can work) 

str %>% 
    tolower() %>%
    word_split() %>% 
    sapply(., function(x) unbag(unique(x))) %>% 
    rm_white_endmark() %>%  
    rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
    unname()

## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519