9

I just did some benchmarking while trying to optimise some code and observed that strsplit with perl=TRUE is faster than running strsplit with perl=FALSE. For example,

set.seed(1)
ff <- function() paste(sample(10), collapse= " ")
xx <- replicate(1e5, ff())

system.time(t1 <- strsplit(xx, "[ ]"))
#  user  system elapsed 
# 1.246   0.002   1.268 

system.time(t2 <- strsplit(xx, "[ ]", perl=TRUE))
#  user  system elapsed 
# 0.389   0.001   0.392 

identical(t1, t2) 
# [1] TRUE

So my question (or rather a variation of the question in the title) is, under what circumstances would be absolutely need perl=FALSE (leaving out the fixed and useBytes parameters)? In other words, what can't we do using perl=TRUE that can be done by setting perl=FALSE?

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    I think `perl=FALSE` is just a design choice, since it is not yet implemented in some regex function ( like `regexec`). Maybe `perl=TRUE` will be the default value in the future R versions. – agstudy Jul 20 '13 at 01:40
  • 3
    @agstudy The benefits would have to be _extremely_ compelling to R Core because changing a default like that would break a _ton_ of existing code. – joran Jul 20 '13 at 01:45
  • @joran good point. But if the benefits is proved to be in all cases , why not to add a sort of "good warning"or "global settings", to encourage the use of PCRE... – agstudy Jul 20 '13 at 01:59
  • or even just a `options(perl.default = TRUE)` – Ricardo Saporta Jul 20 '13 at 02:01
  • @agstudy I have no strong feelings either way. I'm just saying, based on what I have observed on the r-devel mailing list, R Core is quite conservative when it comes to changes that break existing code. – joran Jul 20 '13 at 02:03
  • 3
    Possible duplicate of [regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)](https://stackoverflow.com/questions/47240375/regular-expressions-in-base-r-perl-true-vs-the-default-pcre-vs-tre) – moodymudskipper Jun 15 '18 at 16:53

1 Answers1

2

from the documentation ;)

Performance considerations

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

Of course, this does not answer the question of "are there any dangers to always using perl=TRUE"

Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 1
    This helps. But I'd like to know specifically, if there are cases where `perl=TRUE` breaks. – Arun Jul 20 '13 at 01:49