Remove a list of whole words that may contain special chars from a character vector without matching parts of words

Question

I have a list of words in R as shown below:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

And I want to remove the words which are found in the above list from the text as below:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

After removing the unwanted myList words, the myText should look like:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

I was using :

  stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")

But this is not helping me. What I should do??

You may use `gsub(paste0("\\s*(?<!\\w)(?:", paste(myList, collapse = "|"), ")(?!\\w)"), "", myText, perl=TRUE)` — Wiktor Stribiżew, Jul 04 '18 at 12:54
Thanks Wiktor... this is good one.. I hope it doesn't remove `ct` from `Product` if `ct` is in myList ?? — LeMarque, Jul 04 '18 at 13:15
Right, it won't. Still, you would need to escape special chars in the `myList` list (if any). — Wiktor Stribiżew, Jul 04 '18 at 13:27
See https://ideone.com/4zraK1 - this is the final solution. The solutions in the [suggested close reason](https://stackoverflow.com/questions/34872957/remove-strings-found-in-vector-1-from-vector-2) will not work in some edge cases as it does not escape special chars. Avinash's solution only handles dots and does not handle start of word positions and trailing word boundary in case inputs end with non-word chars. — Wiktor Stribiżew, Jul 04 '18 at 13:33
@Sotos Do you think I can post an answer [in your thread](https://stackoverflow.com/questions/34872957/remove-strings-found-in-vector-1-from-vector-2), or shall we reopen this one? Do you think I should post this answer at all? — Wiktor Stribiżew, Jul 05 '18 at 08:43
@WiktorStribiżew IMHO It is better to add it in my thread (i.e. the dupe target). At least that is what I do when I want to add a new answer to a duped question. As for whether or not you should add it, I can't really tell as my regex is not to a level where I can judge If your current solution has value. But, by all means, If you want to reopen this one, we can do this too. — Sotos, Jul 05 '18 at 08:48
@Sotos I agree the questions are asked the same way, so it only makes sense to reopen this one if the word boundary check is explicitly required in the question title. If not, I agree it should stay closed, and I will add an answer in your thread. — Wiktor Stribiżew, Jul 05 '18 at 08:52
@WiktorStribiżew Wiktor, I checked, my question is different than it was marked duplicate of other. I don't know why this is still marked as duplicate. — LeMarque, Jul 05 '18 at 12:22
@WiktorStribiżew can you please use your code to answer this thread, I checked it works exactly the same ways I wanted it to work. — LeMarque, Jul 05 '18 at 12:23
Ok, now, after you edited the question, it is no longer a dupe. I amended the wording though. — Wiktor Stribiżew, Jul 05 '18 at 13:01

score 1 · Answer 1 · answered Jul 04 '18 at 12:51

1

gsub(paste0(myList, collapse = "|"), "", myText)

gives:

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."

answered Jul 04 '18 at 12:51

Lennyy

5,932
2
10
23

This is also removing `ct` from `Product` to return `Produ` ??? it means, above code do not work properly. – LeMarque Jul 04 '18 at 13:10
1

Apologies, did not realize that, use Wiktor's code instead. :) – Lennyy Jul 04 '18 at 13:26
Yes thanks. I think Wiktor's code works well in this case. – LeMarque Jul 05 '18 at 12:22

score 1 · Accepted Answer · answered Jul 05 '18 at 12:58

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

See the regex demo.

Details

\s* - 0 or more whitespaces
(?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
(?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

See an R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

Details

escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

2 Answers2

Linked

Related