3

I am trying to replace substrings of string elements within a vector with blank spaces. Below are the vectors we are considering:

test <- c("PALMA DE MALLORCA", "THE RICH AND THE POOR", "A CAMEL IN THE DESERT", "SANTANDER SL", "LA")

lista <- c("EL", "LA", "ES", "DE", "Y", "DEL", "LOS", "S.L.", "S.A.", "S.C.", "LAS",
       "DEL", "THE", "OF", "AND", "BY", "S", "L", "A", "C", "SA", "SC", "SL")

Then if we apply the mgsub function as it is, we get the following output:

library(qdap)
mgsub(lista, "", test)
# [1] "PM MOR"   "RIH POOR" "M IN ERT" "NTER"     ""  

So I change my list to the following and reexecute:

lista <- paste("\\b", lista, "\\b", sep = "")
mgsub(lista, "", test)
# [1] "PALMA DE MALLORCA"     "THE RICH AND THE POOR" "A CAMEL IN THE DESERT"
# [4] "SANTANDER SL"          "LA"   

I cannot get the word boundary regex to work for this function.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
MN Beitelmal
  • 165
  • 8
  • First, try `lista <- paste("(?<!\\w)", lista, "(?!\\w)", sep = "")` and then `mgsub(lista, "", test, perl=TRUE)`. Word boundaries won't work for all the items in `lista`, those that end with `.`. – Wiktor Stribiżew Oct 29 '15 at 11:42
  • @stribizhev tried it but still doesn't extract the elements in test from the pattern lista – MN Beitelmal Oct 29 '15 at 11:45
  • @stribizhev I removed all punctuation, but I still can't get it to function. Any ideas? – MN Beitelmal Oct 29 '15 at 11:52
  • 1
    The default `fixed = TRUE` is likely what is causing issue. Use `fixed = FALSE`. As in: `mgsub(lista, "", test, fixed=FALSE); ##[1] "PALMA MALLORCA" "RICH POOR" "CAMEL IN DESERT" "SANTANDER" "" ` – Tyler Rinker Oct 29 '15 at 12:02
  • @TylerRinker that was exactly the issue, I didn't understand that argument fully before, and I set it as TRUE. Thanks – MN Beitelmal Oct 29 '15 at 12:07
  • @MNBeitelmal: According to [documenation](http://www.inside-r.org/packages/cran/qdap/docs/multigsub), `fixed=TRUE` is really a culprit. However, you still need to handle list items like `S.A.`, right? Did you test them with my approach or still `\\b` works best for you? – Wiktor Stribiżew Oct 29 '15 at 12:53
  • @stribizhev "\\b" worked for my purposes, but I was interested to try your approach so I did. Is "(?!\\w)" some kinf regex? – MN Beitelmal Oct 30 '15 at 10:34
  • `(?!\w)` is a negative look-ahead that will fail a match (=no match will be found) if a word character appears after the current position in string. – Wiktor Stribiżew Oct 30 '15 at 10:36

1 Answers1

2

According to multigsub {qdap} documentation:

mgsub(pattern, replacement = NULL, text.var, leadspace = FALSE, trailspace = FALSE, fixed = TRUE, trim = TRUE, ...)
...
fixed
logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

To make sure your vector of search terms is parsed as regular expressions, you need to "manually" set the fixed parameter to FALSE.

Another important note: the word boundary set after . requires a word character after it (or end of line). It is safer to use (?!\w) subpattern in this case. To use look-arounds in R regex, you need to use Perl-like regex. Thus, I suggest using this (if a non-word character can appear only at the end of the regex):

lista <- paste("\\b", lista, "(?!\\w)", sep = "")

or (if there can be a non-word character at the beginning, too):

lista <- paste("(?<!\\w)", lista, "(?!\\w)", sep = "")

and then

mgsub(lista, "", test, fixed=FALSE, perl=TRUE)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563