6

Jeffrey Friedl lists 3 main types of regex engines in his book "Mastering Regular Expressions":

  • Traditional NFA
  • POSIX NFA
  • DFA (POSIX or not)

Which of these does R use as a standard?

histelheim
  • 4,938
  • 6
  • 33
  • 63

3 Answers3

10

The ?regex page cites the TRE documentation. Near the top of the grep.c source we see:

/* As from TRE 0.8.0, tre.h replaces regex.h */
#include <tre/tre.h>

And copying my earlier comment: http://swtch.com/~rsc/regexp says TRE uses NFA. Then PCRE is used for perl=TRUE.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • And `stringr` uses [ICU regex flavor](https://unicode-org.github.io/icu/userguide/strings/regexp.html). – Wiktor Stribiżew Sep 07 '22 at 16:07
  • @WiktorStribiżew (My hazy memories, so corrections solicited.) In the beginning `stringr` was written in base R as a student project under hadley's tutelage, but its performance suffered in comparison with `stringi` so the authors remade `stringr` as a wrapper for `stringi`. – IRTFM Sep 07 '22 at 20:09
  • I'm not sure where that memory of `stringr` being a student project came from, but reviewing the DESCRIPTION files of the package's Archive, it seems that the switch to `stringi` code was around 2015 with the release of ver 1.0.0 – IRTFM Sep 07 '22 at 20:21
2

My understanding (But I have not found this in official documents) is that the R regex functions by default use the tcl regex library which is a hybrid of DFA and NFA.

The engine will first scan the regexp for any non-DFA compatible pieces and extract parts that are DFA (so strips out back references and other things that are only available in NFA). It then tries to find a match to this (possibly) simplified pattern using a DFA engine. If it cannot find a match then the full regex will not match and it returns with a failure. If it finds a match then it goes back and matches the full regex using an NFA engine (I think traditional/non-posix), but starting at the location where the simplified match occurred. This is much faster (for both non-matches and matches) than a straight NFA engine, but still lets you use all the things in an NFA that a DFA does not support.

If you specify perl=TRUE in any function then it switches to the pcre library which is most like a traditional NFA (though I understand that it is not F, A, or traditional).

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
1

Take at look at ?regex for all the gory details (HTML version). There are several options.

To quote one portion:

This section covers the regular expressions allowed in the default mode of grep, regexpr, gregexpr, sub, gsub and strsplit. They use an implementation of the POSIX 1003.2 standard: that allows some scope for interpretation and the interpretations here are those currently used by R. The implementation supports some extensions to the standard.

Also check further down in the help page about "Perl-like expressions", which use the PCRE engine.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
Bryan Hanson
  • 6,055
  • 4
  • 41
  • 78
  • This still does not answer the question. I understand it's POSIX, but is it NFA or DFA? – histelheim Jun 24 '15 at 19:46
  • 3
    @histelheim as backreferences are supported in POSIX mode, that'll be a NFA engine. – Lucas Trzesniewski Jun 24 '15 at 19:49
  • I don't know, and those terms do not appear in the R documentation or packages. `library("sos"); findFn("DFA"); findFn("NFA")` return only trivial, unrelated answers. Maybe someone else will know. – Bryan Hanson Jun 24 '15 at 19:50
  • @Bryan that's regex terminology that has nothing to do with R. NFA = *Nondeterministic Finite Automaton*, DFA = *Deterministic Finite Automaton*. Much too broad to explain it here. – Lucas Trzesniewski Jun 24 '15 at 19:53
  • 1
    @joran yes, I've explained it myself [here](http://stackoverflow.com/a/30416338/3764814), but there's way more to that to cover it all in depth. – Lucas Trzesniewski Jun 24 '15 at 19:57
  • 2
    Well, I'm sorry to learn that there is yet another layer of complexity to the whole regex game... – Bryan Hanson Jun 24 '15 at 20:06