1

I have strings with multiple potential duplicated words:

df <- data.frame(
  words = c("if,go,if,to,go,and,if,go,don't,is,give,to,my,go",
            NA,
            "like,like,so,many,times,like,so,one,no,no,no,bathroom"))

I would like to reduce the words strings such that only the unique words values remain. I've tried this regex but the result it produces is far from perfect:

library(stringr)
str_extract_all(df$words, "(?<=\\s|^)(\\w+)(?=,|$)(?!\\1+)")
[[1]]
[1] "if"

[[2]]
[1] NA

[[3]]
[1] "like"

The result I need to get (preferably with a regex answer) is this:

[[1]]
[1] "if,go,to,and,don't,is,give,my"

[[2]]
[1] NA

[[3]]
[1] "like,so,many,times,one,no,bathroom"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34

2 Answers2

3

Here is a base R solution using gsub:

df$words <- gsub("(?<![^,])(.*?),(?=.*\\1)", "", df$words, perl=TRUE)
df

                               words
1      and,if,don't,is,give,to,my,go
2                               <NA>
3 many,times,like,so,one,no,bathroom

Data:

df <- data.frame(words = c("if,go,if,to,go,and,if,go,don't,is,give,to,my,go",
                           NA,
                           "like,like,so,many,times,like,so,one,no,no,no,bathroom"))

Here is an explanation of the regex pattern:

(?<![^,])  assert that what precedes is either a comma or the start of the string
(.*?)      match AND capture a word, up until reaching
,          the nearest following comma
(?=.*\\1)  then assert that we can still find this same word later on
           in the string, indicating that what we just matched is a duplicate

Then, we replace such duplicate words with empty string, to effectively remove them from the input.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Order of words changed? – zx8754 Jun 16 '21 at 10:25
  • 1
    @zx8754 I guess the OP wants to _retain_ the earliest occurrence and remove subsequent duplicates. You can't use my regex solution like that, because it removes earlier occurrences. Hopefully this can still help the OP. – Tim Biegeleisen Jun 16 '21 at 10:27
  • 100% agreed, just wanted to highlight the output is different. – zx8754 Jun 16 '21 at 10:28
  • Brilliant. I'm not sure though I understand this: "`(?<![^,])` assert that what precedes is either a comma or the start of the string" - aren't we asserting that what precedes is **not (`!`) not (`[^,]`)** a comma? – Chris Ruehlemann Jun 16 '21 at 10:35
  • @ChrisRuehlemann Correct, and not-not-a comma means a comma. But that negative lookbehind _also_ matches nothing, i.e. the start of the string. This is an important edge case, because we might want to remove the very first word, should it have some duplicate downstream in the input. – Tim Biegeleisen Jun 16 '21 at 10:42
  • So this `(?<=,|^)` would work too? – Chris Ruehlemann Jun 16 '21 at 10:45
  • Yes it would, but it's more verbose (and probably less performant) than `(?<![^,])`. – Tim Biegeleisen Jun 16 '21 at 10:46
2
lapply(strsplit(df$words, ",") , function(x) paste(unique(x), collapse = ","))

# [[1]]
# [1] "if,go,to,and,don't,is,give,my"
# 
# [[2]]
# [1] "NA"
# 
# [[3]]
# [1] "like,so,many,times,one,no,bathroom"
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Please do not rush with answers, if this is the answer, then it is a duplicate. I think OP wants regex solution. – zx8754 Jun 16 '21 at 10:23
  • Oh, if I misunderstood then I can delete the post. If it's a duplicate I guess there is no harm in answering even though it will be marked as duplicate? – s_baldur Jun 16 '21 at 10:28
  • Answering duplicates is discouraged. https://meta.stackexchange.com/q/10841/228487 – zx8754 Jun 16 '21 at 10:33
  • 1
    Agree, just at the time I forgot to consider if a duplicate exists. – s_baldur Jun 16 '21 at 10:35