Removing Duplicate Words Among Multiple Strings Across a Single Vector in R

Question

I have a vector made up of multiple strings (in R):

vec <- c("the cat the cat ran up the tree tree", "the dog ran up the up the tree", 
         "the squirrel squirrel ran up the tree")

I need clean the duplicate words from each separate string.

Desired output:

"the cat ran up the tree"
"the dog ran up the tree"
"the squirrel ran up the tree"

I've tried the solution under: Removing duplicate words in a string in R . However this only conglomerates my multiple strings into a single complex string.

I would like to up vote this (it is a helpful answer that adds to the knowledge base), but I am unable to. — Englishman Bob, Feb 25 '21 at 18:45

score 1 · Accepted Answer · answered Feb 25 '21 at 17:55

1

We can use gsub to match two sets of words and one word repeats

gsub("((\\w+\\s+\\w+\\s?)|(\\w+\\s+))\\1+", "\\1", vec)
#[1] "the cat ran up the tree"    
#[2]  "the dog ran up the tree"     
#[3] "the squirrel ran up the tree"

answered Feb 25 '21 at 17:55

akrun

874,273
37
540
662

Can you recommend a regex tutorial that covers these rules? – Englishman Bob Feb 25 '21 at 18:02
1

@EnglishmanBob Perhaps [this](https://datascience.stackexchange.com/questions/34039/regex-to-remove-repeating-words-in-a-sentence) would give some similar syntax explanation – akrun Feb 25 '21 at 18:16

Removing Duplicate Words Among Multiple Strings Across a Single Vector in R

1 Answers1