Jeffrey Friedl lists 3 main types of regex engines in his book "Mastering Regular Expressions":
- Traditional NFA
- POSIX NFA
- DFA (POSIX or not)
Which of these does R use as a standard?
Jeffrey Friedl lists 3 main types of regex engines in his book "Mastering Regular Expressions":
Which of these does R use as a standard?
The ?regex
page cites the TRE documentation. Near the top of the grep.c
source we see:
/* As from TRE 0.8.0, tre.h replaces regex.h */
#include <tre/tre.h>
And copying my earlier comment: http://swtch.com/~rsc/regexp says TRE uses NFA. Then PCRE is used for perl=TRUE
.
My understanding (But I have not found this in official documents) is that the R regex functions by default use the tcl regex library which is a hybrid of DFA and NFA.
The engine will first scan the regexp for any non-DFA compatible pieces and extract parts that are DFA (so strips out back references and other things that are only available in NFA). It then tries to find a match to this (possibly) simplified pattern using a DFA engine. If it cannot find a match then the full regex will not match and it returns with a failure. If it finds a match then it goes back and matches the full regex using an NFA engine (I think traditional/non-posix), but starting at the location where the simplified match occurred. This is much faster (for both non-matches and matches) than a straight NFA engine, but still lets you use all the things in an NFA that a DFA does not support.
If you specify perl=TRUE
in any function then it switches to the pcre library which is most like a traditional NFA (though I understand that it is not F, A, or traditional).
Take at look at ?regex
for all the gory details (HTML version). There are several options.
To quote one portion:
This section covers the regular expressions allowed in the default mode of grep, regexpr, gregexpr, sub, gsub and strsplit. They use an implementation of the POSIX 1003.2 standard: that allows some scope for interpretation and the interpretations here are those currently used by R. The implementation supports some extensions to the standard.
Also check further down in the help page about "Perl-like expressions", which use the PCRE engine.