1

Lets say my data looks like this:

vector = c("Happiness with KK Happiness without KK", "I love some coding I love major coding", "fun 2 fun 3")

I want to remove ALL duplicate words, including the first instance of each duplicate word. So, my output would look like this:

[1] "with without"
[2] "some major"
[3] "2 3"

Basically, it's similar to this problem: How do keep only unique words within each string in a vector. However I don't want to keep even the first instance of a duplicated word.

I tried to use strsplit() along " " and duplicated() to split each string into its various words and then detect duplicates.

The issue with using duplicated() is that it only returns a logical vector of the second instance of the duplicate word. Furthermore, using strsplit() gives me the output in the form of a list, which really complicates things, for example, when I want to obtain a subset of the duplicate words (usually something like df[duplicated(df)] which doesn't work on lists).

Community
  • 1
  • 1
rivda
  • 11
  • 2
  • related: http://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts – dww Feb 15 '17 at 00:44

3 Answers3

5

Use duplicated to check forwards and back taking advantage of fromLast=TRUE:

lapply(strsplit(vector, "\\s+"), function(x) 
  x[!(duplicated(x) | duplicated(x,fromLast=TRUE))]  
)
#[[1]]
#[1] "with"    "without"
#
#[[2]]
#[1] "some"  "major"
#
#[[3]]
#[1] "2" "3"
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • And, to add (maybe can add as an edit?) wrap call to `lapply` with `data.frame()` to return a data frame – HFBrowning Feb 14 '17 at 22:55
  • @HFBrowning - not really, no. If the length of each list item is different then it can't just be wrapped in `data.frame` – thelatemail Feb 14 '17 at 22:55
  • 1
    Alternately `lapply(strsplit(vector, "\\s+"), function(x) x[ave(x, x, FUN = length)==1L])` – Frank Feb 14 '17 at 22:57
3

A text-mining approach with tidytext:

library(dplyr)
library(tidytext)

data_frame(vector = c("Happiness with KK Happiness without KK","I love some coding I love major coding", "fun 2 fun 3"),
           id = seq_along(vector)) %>% 
    unnest_tokens(word, vector) %>% 
    count(id, word) %>% 
    filter(n == 1) %>% 
    summarise(vector = paste(word, collapse = ' '))

#> # A tibble: 3 × 2
#>      id       vector
#>   <int>        <chr>
#> 1     1 with without
#> 2     2   major some
#> 3     3          2 3

Probably overkill, honestly, but it depends on your larger context.

alistaire
  • 42,459
  • 4
  • 77
  • 117
1

You can also use table function to obtain frequency and select the only ones whose frequency is 1.

sapply(strsplit(vector," "), function(x) names(table(x))[t(table(x))[1,] == 1])
d.b
  • 32,245
  • 6
  • 36
  • 77