How to extract unique words/remove duplicated words from strings

Question

I have strings with multiple potential duplicated words:

df <- data.frame(
  words = c("if,go,if,to,go,and,if,go,don't,is,give,to,my,go",
            NA,
            "like,like,so,many,times,like,so,one,no,no,no,bathroom"))

I would like to reduce the words strings such that only the unique words values remain. I've tried this regex but the result it produces is far from perfect:

library(stringr)
str_extract_all(df$words, "(?<=\\s|^)(\\w+)(?=,|$)(?!\\1+)")
[[1]]
[1] "if"

[[2]]
[1] NA

[[3]]
[1] "like"

The result I need to get (preferably with a regex answer) is this:

[[1]]
[1] "if,go,to,and,don't,is,give,my"

[[2]]
[1] NA

[[3]]
[1] "like,so,many,times,one,no,bathroom"

Possible duplicate https://stackoverflow.com/a/37260468/680068 — zx8754, Jun 16 '21 at 10:21
Regex is not a universal solution for every coding problem. Use it only when necessary. — Wiktor Stribiżew, Jun 16 '21 at 10:23
If regex is requierement I guess it is not a duplicate question. — s_baldur, Jun 16 '21 at 10:36
Out of curiosity & to better understand the problem, why does it need to be done with regex? Is there a performance reason or some other context why that's preferable over a split / unique / paste method? — camille, Jun 16 '21 at 16:35
The reason is that I'm keenly interested in regex, and wanted to see how it works there, that's all. — Chris Ruehlemann, Jun 16 '21 at 17:22

Tim Biegeleisen · Accepted Answer · 2021-06-16T10:26:18.570

3

Here is a base R solution using gsub:

df$words <- gsub("(?<![^,])(.*?),(?=.*\\1)", "", df$words, perl=TRUE)
df

                               words
1      and,if,don't,is,give,to,my,go
2                               <NA>
3 many,times,like,so,one,no,bathroom

Data:

df <- data.frame(words = c("if,go,if,to,go,and,if,go,don't,is,give,to,my,go",
                           NA,
                           "like,like,so,many,times,like,so,one,no,no,no,bathroom"))

Here is an explanation of the regex pattern:

(?<![^,])  assert that what precedes is either a comma or the start of the string
(.*?)      match AND capture a word, up until reaching
,          the nearest following comma
(?=.*\\1)  then assert that we can still find this same word later on
           in the string, indicating that what we just matched is a duplicate

Then, we replace such duplicate words with empty string, to effectively remove them from the input.

edited Jun 16 '21 at 10:26

answered Jun 16 '21 at 10:24

Tim Biegeleisen

502,043
27
286
360

Order of words changed? – zx8754 Jun 16 '21 at 10:25
1

@zx8754 I guess the OP wants to _retain_ the earliest occurrence and remove subsequent duplicates. You can't use my regex solution like that, because it removes earlier occurrences. Hopefully this can still help the OP. – Tim Biegeleisen Jun 16 '21 at 10:27
100% agreed, just wanted to highlight the output is different. – zx8754 Jun 16 '21 at 10:28
Brilliant. I'm not sure though I understand this: "`(?<![^,])` assert that what precedes is either a comma or the start of the string" - aren't we asserting that what precedes is **not (`!`) not (`[^,]`)** a comma? – Chris Ruehlemann Jun 16 '21 at 10:35
@ChrisRuehlemann Correct, and not-not-a comma means a comma. But that negative lookbehind _also_ matches nothing, i.e. the start of the string. This is an important edge case, because we might want to remove the very first word, should it have some duplicate downstream in the input. – Tim Biegeleisen Jun 16 '21 at 10:42
So this `(?<=,|^)` would work too? – Chris Ruehlemann Jun 16 '21 at 10:45
Yes it would, but it's more verbose (and probably less performant) than `(?<![^,])`. – Tim Biegeleisen Jun 16 '21 at 10:46

score 2 · Answer 2 · answered Jun 16 '21 at 10:20

2

lapply(strsplit(df$words, ",") , function(x) paste(unique(x), collapse = ","))

# [[1]]
# [1] "if,go,to,and,don't,is,give,my"
# 
# [[2]]
# [1] "NA"
# 
# [[3]]
# [1] "like,so,many,times,one,no,bathroom"

answered Jun 16 '21 at 10:20

s_baldur

29,441
4
36
69

Please do not rush with answers, if this is the answer, then it is a duplicate. I think OP wants regex solution. – zx8754 Jun 16 '21 at 10:23
Oh, if I misunderstood then I can delete the post. If it's a duplicate I guess there is no harm in answering even though it will be marked as duplicate? – s_baldur Jun 16 '21 at 10:28
Answering duplicates is discouraged. https://meta.stackexchange.com/q/10841/228487 – zx8754 Jun 16 '21 at 10:33
1

Agree, just at the time I forgot to consider if a duplicate exists. – s_baldur Jun 16 '21 at 10:35

How to extract unique words/remove duplicated words from strings

2 Answers2