8

I have data that looks like this:

vector = c("hello I like to code hello","Coding is fun", "fun fun fun")

I want to remove duplicate words (space delimited) i.e. the output should look like

vector_cleaned

[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"
lmo
  • 37,904
  • 9
  • 56
  • 69
shecode
  • 1,716
  • 6
  • 32
  • 50

3 Answers3

16

Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:

vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun"        "fun"  

## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))

Update based on comments

You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.

myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
  a <- if (isTRUE(onlyUnique)) unique(x) else x
  paste(a[nchar(a) > minLen], collapse = " ")
}

Compare the output of the following to see how it would work.

vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • can I apply this same technique to remove any words in the split string that have less than 3 characters? – shecode Jan 19 '15 at 22:47
  • @shecode, the approach would be similar, but you would have to add one more requirement based on the result of `nchar` (which would count the number of characters in the string). On my phone right now, so I can't show the code, but I'll try to update later. Ideally, if I do so, the question should also be updated. – A5C1D2H2I1M1N2O1R2T1 Jan 20 '15 at 02:57
  • Thankyou. I figured it out how to do it based on the structure of your answer. very useful – shecode Jan 20 '15 at 21:15
2

I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:

library(tidyverse)

df <- data.frame(vector = c("hello I like to code hello",
                            "Coding is fun", 
                            "fun fun fun"))

df %>% 
  mutate(split = str_split(vector, " ")) %>% # split
  mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
  mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine

Result:

#>                       vector                split
#> 1 hello I like to code hello hello I like to code
#> 2              Coding is fun        Coding is fun
#> 3                fun fun fun                  fun

Created on 2021-01-03 by the reprex package (v0.3.0)

Kene David Nwosu
  • 828
  • 6
  • 12
1

Using tidyverse

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   mutate(rn = row_number()) %>% 
   separate_longer_delim(vector, delim = regex("\\s+")) %>%
   distinct() %>%
   reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>% 
  select(-rn)

-output

                vector
1 hello I like to code
2        Coding is fun
3                  fun
akrun
  • 874,273
  • 37
  • 540
  • 662