2

Given a string, I need to make many substitutions for different patterns:

subst <- fread(c("
                 regex         ; replacement
                 abc\\w*\\b    ; alphabet
                 red           ; color
                 \\d+          ; number
                 ")
             , sep = ";"
             )

> subst
        regex replacement
1: abc\\w*\\b    alphabet
2:        red       color
3:       \\d+      number

So, for string text <- c( "abc 24 red bcd"), the expected output would be:

alphabet number color bcd

I tried the follwing code:

mapply(function(x,y) gsub(x, y, text, perl = T)
       , subst$regex
       , subst$replacement
       )

The output I got:

"alphabet 24 red bcd"    "abc 24 color bcd"  "abc number red bcd" 

This code performs each substitution one at a time, and not all at once. What should I do to get the expected result?

Fabio Correa
  • 1,257
  • 1
  • 11
  • 17
  • 4
    A boring old `for` loop can work nicely in this circumstance - https://stackoverflow.com/questions/26171318/regex-for-preserving-case-pattern-capitalization/26171700 for instance. Plus various other options at that answer. – thelatemail Apr 03 '23 at 02:37
  • 1
    @thelatemail ... that's a great dupe, though I wish the title were a bit clearer: reading the title alone, it's no doubt why it wouldn't be a good candidate on skimming search results. Since you were involved on it, perhaps you can suggest an edit to the question title to clarify the underlying need? That way, it'll percolate to the top better with more relevance. – r2evans Apr 03 '23 at 02:45
  • @r2evans - i'll give it some thought - the question is quite specific but the answers more general. – thelatemail Apr 03 '23 at 02:50

3 Answers3

4

You can perform multiple substitutions by passing a named character vector to stringr::str_replace_all().

library(stringr)

str_replace_all(text, setNames(subst$replacement, subst$regex))
# "alphabet number color bcd"

As an alternative to setNames(), you could convert your table to a named vector using tibble::deframe().

library(stringr)
library(tibble)

str_replace_all(text, deframe(subst))
# "alphabet number color bcd"
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • 3
    Nice and clean answer (+1), but it is worth noting there is no such thing as multiple substitutions at once when you're making potentially conflicting changes. E.g. there will be order effects: `x <- "aa ab cc"; str_replace_all(x, c("a."="11", "aa"="22")); str_replace_all(x, c("aa"="22", "a."="11"))` – thelatemail Apr 03 '23 at 02:46
  • 1
    Good point, thanks; I’ve removed the phrase “at once.” – zephryl Apr 03 '23 at 02:55
3

I think zephryl's answer is a great one-step.

The reason your mapply solution doesn't work is that on each iteration, it works on the value of text at the time it was started, it does not do the work on the results from the previous replacement.

For that, we can use Reduce:

Reduce(function(txt, i) gsub(subst$regex[i], subst$replacement[i], txt, perl = TRUE),
       seq_len(nrow(subst)), init = text)
# [1] "alphabet number color bcd"

We can see what's happening step-by-step by adding accumulate=TRUE:

Reduce(function(txt, i) gsub(subst$regex[i], subst$replacement[i], txt, perl = TRUE),
       seq_len(nrow(subst)), init = text, accumulate = TRUE)
# [1] "abc 24 red bcd"            "alphabet 24 red bcd"      
# [3] "alphabet 24 color bcd"     "alphabet number color bcd"

In fact, based on @thelatemail's recent comment and link, they provided an answer nearly identical to this in 2014. The only difference is how it deals with a reduction over two vectors (the two columns of subst). Both methods work equally well, use which one reads more easily to you:

Reduce(function(txt, ptn) gsub(ptn[1], ptn[2], txt, perl = TRUE),
       Map(c, subst$regex, subst$replacement), init = text)
r2evans
  • 141,215
  • 6
  • 77
  • 149
1

As per @r2evans and @zephryl answers, testing for speed:

text <- c( "abc 24 red bcd")

microbenchmark(
stringr = str_replace_all(text, setNames(subst$replacement, subst$regex)),
reduce  = Reduce(function(txt, ptn) gsub(ptn[1], ptn[2], txt, perl = TRUE),
       Map(c, subst$regex, subst$replacement), init = text),
times = 1000L)
Unit: microseconds
    expr   min     lq     mean median     uq    max neval
 stringr 435.2 459.60 539.6708  474.4 508.10 5210.8  1000
  reduce 405.2 416.75 462.6666  431.0 443.05 2580.7  1000

Reduce is about 10% faster, but str_replace_all is more legible and straightfoward to write. Thank you.

Fabio Correa
  • 1,257
  • 1
  • 11
  • 17