How do I keep only unique words within each string in a vector? I.e. remove ALL duplicates

Question

Lets say my data looks like this:

vector = c("Happiness with KK Happiness without KK", "I love some coding I love major coding", "fun 2 fun 3")

I want to remove ALL duplicate words, including the first instance of each duplicate word. So, my output would look like this:

[1] "with without"
[2] "some major"
[3] "2 3"

Basically, it's similar to this problem: How do keep only unique words within each string in a vector. However I don't want to keep even the first instance of a duplicated word.

I tried to use strsplit() along " " and duplicated() to split each string into its various words and then detect duplicates.

The issue with using duplicated() is that it only returns a logical vector of the second instance of the duplicate word. Furthermore, using strsplit() gives me the output in the form of a list, which really complicates things, for example, when I want to obtain a subset of the duplicate words (usually something like df[duplicated(df)] which doesn't work on lists).

related: http://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts — dww, Feb 15 '17 at 00:44

score 5 · Answer 1 · answered Feb 14 '17 at 22:50

5

Use duplicated to check forwards and back taking advantage of fromLast=TRUE:

lapply(strsplit(vector, "\\s+"), function(x) 
  x[!(duplicated(x) | duplicated(x,fromLast=TRUE))]  
)
#[[1]]
#[1] "with"    "without"
#
#[[2]]
#[1] "some"  "major"
#
#[[3]]
#[1] "2" "3"

answered Feb 14 '17 at 22:50

thelatemail

91,185
12
128
188

And, to add (maybe can add as an edit?) wrap call to `lapply` with `data.frame()` to return a data frame – HFBrowning Feb 14 '17 at 22:55
@HFBrowning - not really, no. If the length of each list item is different then it can't just be wrapped in `data.frame` – thelatemail Feb 14 '17 at 22:55
1

Alternately `lapply(strsplit(vector, "\\s+"), function(x) x[ave(x, x, FUN = length)==1L])` – Frank Feb 14 '17 at 22:57

alistaire · Answer 2 · 2017-02-14T22:59:39.677

A text-mining approach with tidytext:

library(dplyr)
library(tidytext)

data_frame(vector = c("Happiness with KK Happiness without KK","I love some coding I love major coding", "fun 2 fun 3"),
           id = seq_along(vector)) %>% 
    unnest_tokens(word, vector) %>% 
    count(id, word) %>% 
    filter(n == 1) %>% 
    summarise(vector = paste(word, collapse = ' '))

#> # A tibble: 3 × 2
#>      id       vector
#>   <int>        <chr>
#> 1     1 with without
#> 2     2   major some
#> 3     3          2 3

Probably overkill, honestly, but it depends on your larger context.

d.b · Answer 3 · 2017-02-14T23:32:04.123

1

You can also use table function to obtain frequency and select the only ones whose frequency is 1.

sapply(strsplit(vector," "), function(x) names(table(x))[t(table(x))[1,] == 1])

edited Feb 14 '17 at 23:32

answered Feb 14 '17 at 23:11

d.b

32,245
6
36
77

How do I keep only unique words within each string in a vector? I.e. remove ALL duplicates

3 Answers3