2

I have a list of words in R as shown below:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

And I want to remove the words which are found in the above list from the text as below:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

After removing the unwanted myList words, the myText should look like:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

I was using :

  stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")

But this is not helping me. What I should do??

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
LeMarque
  • 733
  • 5
  • 21
  • 2
    You may use `gsub(paste0("\\s*(?<!\\w)(?:", paste(myList, collapse = "|"), ")(?!\\w)"), "", myText, perl=TRUE)` – Wiktor Stribiżew Jul 04 '18 at 12:54
  • Thanks Wiktor... this is good one.. I hope it doesn't remove `ct` from `Product` if `ct` is in myList ?? – LeMarque Jul 04 '18 at 13:15
  • 1
    Right, it won't. Still, you would need to escape special chars in the `myList` list (if any). – Wiktor Stribiżew Jul 04 '18 at 13:27
  • 1
    See https://ideone.com/4zraK1 - this is the final solution. The solutions in the [suggested close reason](https://stackoverflow.com/questions/34872957/remove-strings-found-in-vector-1-from-vector-2) will not work in some edge cases as it does not escape special chars. Avinash's solution only handles dots and does not handle start of word positions and trailing word boundary in case inputs end with non-word chars. – Wiktor Stribiżew Jul 04 '18 at 13:33
  • 1
    @Sotos Do you think I can post an answer [in your thread](https://stackoverflow.com/questions/34872957/remove-strings-found-in-vector-1-from-vector-2), or shall we reopen this one? Do you think I should post this answer at all? – Wiktor Stribiżew Jul 05 '18 at 08:43
  • @WiktorStribiżew IMHO It is better to add it in my thread (i.e. the dupe target). At least that is what I do when I want to add a new answer to a duped question. As for whether or not you should add it, I can't really tell as my regex is not to a level where I can judge If your current solution has value. But, by all means, If you want to reopen this one, we can do this too. – Sotos Jul 05 '18 at 08:48
  • 1
    @Sotos I agree the questions are asked the same way, so it only makes sense to reopen this one if the word boundary check is explicitly required in the question title. If not, I agree it should stay closed, and I will add an answer in your thread. – Wiktor Stribiżew Jul 05 '18 at 08:52
  • @WiktorStribiżew Wiktor, I checked, my question is different than it was marked duplicate of other. I don't know why this is still marked as duplicate. – LeMarque Jul 05 '18 at 12:22
  • @WiktorStribiżew can you please use your code to answer this thread, I checked it works exactly the same ways I wanted it to work. – LeMarque Jul 05 '18 at 12:23
  • Ok, now, after you edited the question, it is no longer a dupe. I amended the wording though. – Wiktor Stribiżew Jul 05 '18 at 13:01

2 Answers2

1
gsub(paste0(myList, collapse = "|"), "", myText)

gives:

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."
Lennyy
  • 5,932
  • 2
  • 10
  • 23
1

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

See the regex demo.

Details

  • \s* - 0 or more whitespaces
  • (?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
  • (?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

See an R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

Details

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563