When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?

Question

I just did some benchmarking while trying to optimise some code and observed that strsplit with perl=TRUE is faster than running strsplit with perl=FALSE. For example,

set.seed(1)
ff <- function() paste(sample(10), collapse= " ")
xx <- replicate(1e5, ff())

system.time(t1 <- strsplit(xx, "[ ]"))
#  user  system elapsed 
# 1.246   0.002   1.268 

system.time(t2 <- strsplit(xx, "[ ]", perl=TRUE))
#  user  system elapsed 
# 0.389   0.001   0.392 

identical(t1, t2) 
# [1] TRUE

So my question (or rather a variation of the question in the title) is, under what circumstances would be absolutely need perl=FALSE (leaving out the fixed and useBytes parameters)? In other words, what can't we do using perl=TRUE that can be done by setting perl=FALSE?

I think `perl=FALSE` is just a design choice, since it is not yet implemented in some regex function ( like `regexec`). Maybe `perl=TRUE` will be the default value in the future R versions. — agstudy, Jul 20 '13 at 01:40
@agstudy The benefits would have to be _extremely_ compelling to R Core because changing a default like that would break a _ton_ of existing code. — joran, Jul 20 '13 at 01:45
@joran good point. But if the benefits is proved to be in all cases , why not to add a sort of "good warning"or "global settings", to encourage the use of PCRE... — agstudy, Jul 20 '13 at 01:59
@agstudy I have no strong feelings either way. I'm just saying, based on what I have observed on the r-devel mailing list, R Core is quite conservative when it comes to changes that break existing code. — joran, Jul 20 '13 at 02:03
Possible duplicate of [regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)](https://stackoverflow.com/questions/47240375/regular-expressions-in-base-r-perl-true-vs-the-default-pcre-vs-tre) — moodymudskipper, Jun 15 '18 at 16:53

score 2 · Answer 1 · answered Jul 20 '13 at 01:48

2

from the documentation ;)

Performance considerations

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

Of course, this does not answer the question of "are there any dangers to always using perl=TRUE"

answered Jul 20 '13 at 01:48

Ricardo Saporta

54,400
17
144
178

1

This helps. But I'd like to know specifically, if there are cases where `perl=TRUE` breaks. – Arun Jul 20 '13 at 01:49

When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?

1 Answers1

Linked