1

I have a vector made up of multiple strings (in R):

vec <- c("the cat the cat ran up the tree tree", "the dog ran up the up the tree", 
         "the squirrel squirrel ran up the tree")

I need clean the duplicate words from each separate string.

Desired output:

"the cat ran up the tree"
"the dog ran up the tree"
"the squirrel ran up the tree"

I've tried the solution under: Removing duplicate words in a string in R . However this only conglomerates my multiple strings into a single complex string.

Englishman Bob
  • 377
  • 2
  • 13

1 Answers1

1

We can use gsub to match two sets of words and one word repeats

gsub("((\\w+\\s+\\w+\\s?)|(\\w+\\s+))\\1+", "\\1", vec)
#[1] "the cat ran up the tree"    
#[2]  "the dog ran up the tree"     
#[3] "the squirrel ran up the tree"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Can you recommend a regex tutorial that covers these rules? – Englishman Bob Feb 25 '21 at 18:02
  • 1
    @EnglishmanBob Perhaps [this](https://datascience.stackexchange.com/questions/34039/regex-to-remove-repeating-words-in-a-sentence) would give some similar syntax explanation – akrun Feb 25 '21 at 18:16