1

Building on top of two questions I previously asked:

R: How to prevent memory overflow when using mgsub in vector mode?

gsub speed vs pattern length

I do like suggestions on usage of fixed=TRUE by @Tyler as it speeds up calculations significantly. However, it's not always applicable. I need to substitute, say, caps as a stand-alone word w/ or w/o punctuation that surrounds it. A priori it's not know what can follow or precede the word, but it must be any of regular punctuation signs (, . ! - + etc). It cannot be a number or a letter. Example below. capsule must stay as is.

i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"          

orig = "caps"
change = "cap"

gsub_FixedTrue <- function(i) {
  i = paste0(" ", i, " ")
  orig = paste0(" ", orig, " ")
  change = paste0(" ", change, " ")

  i = gsub(orig,change,i,fixed=TRUE)
  i = gsub("^\\s|\\s$", "", i, perl=TRUE)

  return(i)
}

#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {

  i = gsub(paste0("\\b",orig,"\\b"),change,i)

  return(i)
}

print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct

Results. Second output is desired

[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"
Community
  • 1
  • 1
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
  • Do what precisely is the question? You want to be able to use `fixed=TRUE` with a pattern that's not fixed? It doesn't work like that. Or you're trying to create a fixed expression equivalent to a non-fixed expression? The whole point of regular expressions is that matching like that is tedious with fixed strings. – MrFlick Feb 05 '15 at 21:48
  • Well, I was hoping that there might be a way to quickly strip the comma or full stop, do `gsub` with `fixed=TRUE` and stitch comma back to changed word. I have a solution for all other cases – Alexey Ferapontov Feb 05 '15 at 21:50

1 Answers1

1

Using parts from your previous question to test I think we can put a place holder in front of punctuation as follows, without slowing it down too much:

line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
    "Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")


line <- rep(line, 1700000/length(line))

line <- gsub("([[:punct:]])", " <DEL>\\1<DEL> ", line, perl=TRUE)

## Start    
line2 <- paste0(" ", line, " ")
e2 <-  paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")


for (i in seq_along(e2)) {
    line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}

gsub("^\\s|\\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thank you! I'll try it out on real data that I am working with and have known results with 'fixed = FALSE'. – Alexey Ferapontov Feb 06 '15 at 00:55
  • Tyler, you are genius!! This works! Thank you very much. May I add just a suggestion to do ` i <- gsub("([[:punct:]])", " \\1 ", i, perl=TRUE)` and ` i = gsub("^\\s|\\s$| | ", "", i, perl=TRUE) ` to treat situations like `-caps,` – Alexey Ferapontov Feb 06 '15 at 02:50
  • Makes sense, you can add as an edit (if you have the points to do so; if not I'll add it). Glad it was helpful. – Tyler Rinker Feb 06 '15 at 03:02
  • I probably don't have enough points - tried but didn't work. Thank you again! That trick was great! – Alexey Ferapontov Feb 06 '15 at 03:07