-1

My code has a for loop that is taking ages for running. I was wondering how can speed up it by using one of the apply family functions available in R.

The for loop that I want to change would look like this:

for (i in range(1:200000)){
    a[i] = gsub(pattern[i],new_pattern[i])
}

Where pattern and new_pattern are both lists. What I want to achieve is to change a character pattern in each line for a new one. I have tried the following:

sapply(c(1:200000),function(x) gsub(pattern[x],new_pattern[x], a[x]))

But it is taking very long too. Any suggestions of how can I improve my code to be faster?

Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
dag90
  • 1
  • 2
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – David Arenburg Nov 02 '15 at 12:24
  • 2
    The looping is completely unnecessary, gsub like almost all functions in base package is already vectorized. Just do gsub(patter, new_pattern, a) instead. – kliron Nov 02 '15 at 12:24
  • As @kliron said, many functions in R are vectorized. For-loops have their uses, but they're rarely needed. – Heroka Nov 02 '15 at 12:26
  • 2
    @kliron not quite, `gsub` can't accept a vector of patterns and replacements. It is only vectorized over `x`. This is why OP needs to provide a reproducible example. – David Arenburg Nov 02 '15 at 12:29
  • If all you care about is speeding up a loop trying going parallel with 'foreach' – Carl Nov 02 '15 at 14:34
  • @Carl feel free to add your solution. An improvement can be made with stringr package. – Colonel Beauvel Nov 02 '15 at 15:35
  • @Carl or using the builtin parallel package. Do mind that you have to take care how you do the parallelization, or else it might up taking even longer... – Paul Hiemstra Nov 02 '15 at 15:41
  • @PaulHiemstra good point – Carl Nov 02 '15 at 16:15

1 Answers1

2

You can proceed with str_replace_all from stringr package using a named vector:

library(stringr)

x = 'dog likes cat very much'
str_replace_all(x, setNames(c('babyboy','babygirl'), c('dog','cat')))

#[1] "babyboy likes babygirl very much"

Performance: 7-8 times faster

set.seed(1)
x = paste0(sample(c(letters,' '), 100000, replace=T, prob=c(rep(1/39, 26), 1/3)), collapse='')

patt = apply(df, 1, paste0, collapse='')
repl = as.character(1:456976)

system.time({
    for (i in 1:456976){
        x = gsub(patt[i],repl[i], x)
    }
})
#   user  system elapsed 
#1574.41    2.41 1582.71

system.time(str_replace_all(x, setNames(repl, patt)))
#   user  system elapsed 
# 194.04    0.14  194.36
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
  • Do you have any idea how the performance of `str_replace_all` compares to looping over `gsub`? I think `str_replace_all` should be faster... – Paul Hiemstra Nov 02 '15 at 13:12