Regex R: Remove strings from text by using a vector string

Question

names <- c('Laars Anderson', 'Peter Grabowski')
text <- c('Laars Anderson needs to bla bla bla, reply from Peter Grabowski')
output <- c('needs to bla bla bla, reply from')

I'm using regex to clean up my text for text mining purpose. The text is mostly about email conversations with lots irrelevant words for the final analysis such as names, emails, etc.

I have the employee names list and want to use this list of names to remove their names from the text emails.

Thanks!

score 4 · Answer 1 · answered Dec 17 '20 at 03:22

4

You can also use stri_replace_all from the "stringi" package:

library(stringi)
stri_replace_all_fixed(text, names, "", vectorize_all=FALSE)
## [1] " needs to bla bla bla, reply from "

Get rid of the leading and trailing whitespace with trimws.

answered Dec 17 '20 at 03:22

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

This old question - https://stackoverflow.com/a/26171700/496803 - can be adapted for this sort of thing in a vectorised fashion, including a pretty interesting `Reduce` option from Flodel - `Reduce(function(str, args) gsub(args, "", str), names, init = text)` – thelatemail Dec 17 '20 at 03:39
I was initially going to post a `for` loop since it also seemed that the option for `fixed = TRUE` would be helpful. (+1 to you over there). That `Reduce` approach is great too! – A5C1D2H2I1M1N2O1R2T1 Dec 17 '20 at 03:41
The `Reduce` adaption thoroughly confused me the first time I saw it. And the second and third times too. – thelatemail Dec 17 '20 at 03:42
`Warning messages: 1: In stri_replace_all_fixed(text, names, "", vectorize_all = FALSE) : empty search patterns are not supported` some edge cases here, the actual names vector is around 10,000, And the text may contain Danish characters. Just in case, I've did some prior cleaning on the text to remove special characters. Thanks, really appreciate this – Afiq Johari Dec 17 '20 at 04:14
@AfiqJohari, I think that Tim's `for` loop is going to be one of the better approaches, and it shouldn't be a problem in terms of performance. If you can isolate a vector that generates the warning you've shared here, that would be helpful with further troubleshooting. – A5C1D2H2I1M1N2O1R2T1 Dec 17 '20 at 04:16
1

@A5C1D2H2I1M1N2O1R2T1 thanks for that, I still find `for` loop pretty slow but it works for the time being. I'll create another question once I got a better understanding on the cause. – Afiq Johari Dec 17 '20 at 04:25

score 2 · Answer 2 · answered Dec 17 '20 at 03:14

2

You can use :

names <- c('Laars Anderson', 'Peter Grabowski')
text <- c('Laars Anderson needs to bla bla bla, reply from Peter Grabowski')

gsub(paste0(names, collapse = ' | '), '', text)
#[1] "needs to bla bla bla, reply from"

answered Dec 17 '20 at 03:14

Ronak Shah

377,200
20
156
213

Thanks @ronak, the solutions work for the reproducible example above, but I guess there're some limitations on the actual length of the vector name to be used. `Error in gsub(paste0(names, collapse = " | "), "", text) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634`, I tried the mgsub alternative function from qdap, but the same issue – Afiq Johari Dec 17 '20 at 04:16

Tim Biegeleisen · Accepted Answer · 2020-12-17T04:08:10.980

2

This is similar to the @Ronak answer, but uses proper word boundaries and whitespace patterns for a cleaner approach:

names <- c('Laars Anderson', 'Peter Grabowski')
text <- 'Laars Anderson needs to bla bla bla Peter Grabowski, reply from Peter Grabowski'
regex <- paste0("\\b\\s*(?:", paste0(names, collapse="|"), ")\\b\\s*")
output = trimws(gsub(regex, " ", text))
output

[1] "needs to bla bla bla , reply from"

If your names vector be really large, to the point where the regex engine can't handle the size of the alternation, then you can always just iterate and make replacements:

names <- c('Laars Anderson', 'Peter Grabowski')
text <- 'Laars Anderson needs to bla bla bla Peter Grabowski, reply from Peter Grabowski'
for (name in names) {
    text <- gsub(paste0("\\b\\s*", name, "\\s*\\b"), "", text)
}
text <- trimws(text)
text

[1] "needs to bla bla bla, reply from"

edited Dec 17 '20 at 04:08

answered Dec 17 '20 at 03:26

Tim Biegeleisen

502,043
27
286
360

Thanks, are you familiar with this error though? `Error in gsub(regex, " ", text) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634` My names vector is pretty long, more than 10,000 user names. Alternatives I found that some ask to use mgsub from qdap library or split this user name vector into smaller chunk – Afiq Johari Dec 17 '20 at 04:04
1

@AfiqJohari I'm actually not familiar with that error, but I know the cause. It's that the alternation is too big. Let me edit my question with a workaround. – Tim Biegeleisen Dec 17 '20 at 04:06
2

@AfiqJohari, if the `for` loop seems slow, I wonder whether there's a justification to switch to a simpler `gsub` with `fixed = TRUE` rather than handling word boundaries and whitespace. Any thoughts, Tim? – A5C1D2H2I1M1N2O1R2T1 Dec 17 '20 at 04:46
@A5C1D2H2I1M1N2O1R2T1 if `fixed=TRUE`, it will fail on zero-length pattern `Error in gsub(pattern = name, replacement = "", comment, fixed = TRUE) : zero-length pattern` – Afiq Johari Dec 17 '20 at 05:00
1

@A5C1D2H2I1M1N2O1R2T1 I retract my earlier comment. Setting fixed to TRUE means we can't use a regex alternation, so sadly that's not an option. However, we _could_ try removing word boundaries and whitespace. That might speed things up a bit. – Tim Biegeleisen Dec 17 '20 at 05:05

GKi · Answer 4 · 2020-12-17T09:44:32.557

0

In case names is too long for the regex engine you can use Reduce to loop to each name.

trimws(Reduce(function(x,y) gsub(y, "", x), paste0("\\b", names, "\\b"), text))
#[1] "needs to bla bla bla, reply from"

or using perl:

trimws(gsub(paste0("\\b", names, "\\b", collapse="|"), "", text, perl=TRUE))

edited Dec 17 '20 at 09:44

answered Dec 17 '20 at 09:37

GKi

37,245
2
26
48

Regex R: Remove strings from text by using a vector string

4 Answers4