15

This Github repository added std::regex to the list of regular expression engines and got decimated by the others.

Why is that std::regex - as implemented in libstdc++ - so much slower than others? Is that because of the C++ standard requirements or it is just that that particular implementation is not very well optimized?

Also in the shootout std::regex was unable to compile several regular expressions that all the others accepted, even after adding the flag std::regex::extended. They were (?i)Twain, \b\w+nn\b, (?i)Tom|Sawyer|Huckleberry|Finn, \s[a-zA-Z]{0,12}ing\s, ([A-Za-z]awyer|[A-Za-z]inn)\s and \p{Sm}.

UPDATE: Added comparison with boost::regex.

UPDATE2: added ctre

enter image description here

  • 1
    Most likely due to generality and support for each target architecture by the standard implementation and/or carelessness. Try to upgrade your compiler or toolchain to maybe see improvements. – BullyWiiPlaza Jan 04 '22 at 18:31
  • @BullyWiiPlaza I ran that on Ubuntu 20.04 g++ 9.3.0 which is about 1 year old. I don't see any major changes in the changelogs since than relative to std::regex. https://gcc.gnu.org/gcc-11/changes.html and https://gcc.gnu.org/gcc-10/changes.html –  Jan 04 '22 at 18:38
  • 3
    Let's just say that there are no performance requirements for the `std::regex`. The `std::regex` requirements are more along the lines of correct operation than speed. – Thomas Matthews Jan 04 '22 at 18:39
  • In the shootout several expressions threw exceptions (not able to compile) as `(?i)Twain`, `\b\w+nn\b`, `(?i)Tom|Sawyer|Huckleberry|Finn`, `\s[a-zA-Z]{0,12}ing\s`, `([A-Za-z]awyer|[A-Za-z]inn)\s` and `\p{Sm}`. Should they? –  Jan 04 '22 at 18:42
  • 2
    Anything with (?i) is going to fail, because it's not part of the extended posix syntax. I'm not sure about the {Sm}. You should look at the exception being thrown during the construction of the regex object, which couiild give you more info. – Dave S Jan 04 '22 at 19:09
  • 1
    Another possibility is that it is optimised for other use cases. You can parse regex with DFAs or NFAs, the construction of a DFA takes more time, but there are regular expressions which can be paraes significantly faster with a DFA. You should try such a special case as an additional test case. – gerum Jan 04 '22 at 21:41
  • 1
    Is this benchmark testing against a lot or a little text? i.e. how much matching is regex construction amortized over? And I assume you enabled full optimization, like `g++ -O3`. – Peter Cordes Jan 04 '22 at 22:47
  • 1
    @PeterCordes The test file is 16MiB with 300k lines and 2.7M words. This fits entirely on the machine's L3 cache. Yes `-O3` in place. The construction of the regex is not included in the benchmark, only the matching part. If you look at the tests, it's a microbenchmark. The regex object is constructed outside the loop. –  Jan 04 '22 at 23:14
  • @gerum Do you have any suggestion in mind? –  Jan 04 '22 at 23:16
  • @ThomasMatthews I'm surprised that anything in the STL can be built without performance in mind. C++ is supposed to be a performance-centered language. –  Jan 04 '22 at 23:17
  • @PeterCordes The test file is also checked in with the code: https://github.com/HFTrader/regex-performance/blob/master/3200.txt –  Jan 04 '22 at 23:32
  • 1
    @jellyboy I looked it up, it was the unix regular expression "(.*?,){13}z" which a full match of strings of the type "1,2,3,4,5,6,7,8,9,10,11,12(,1)^i" which numbers of repetitions for the trailing ones. But according to my document C++ belong to the slow implementations, not the fast ones(like egrep). – gerum Jan 05 '22 at 07:46
  • @gerum Boost threw an exception, std::regex took 16 SECONDS, re2 and rust about 20 milliseconds. –  Jan 05 '22 at 08:49

1 Answers1

17

Is that because of the C++ standard requirements or it is just that that particular implementation is not very well optimized?

The answer is yes. Kinda.

There is no question that libstdc++'s implementation of <regex> is not well optimized. But there is more to it than that. It's not that the standard requirements inhibit optimizations so much as the standard requirements inhibit changes.

The regex library is defined through a bunch of templates. This allows people to choose between char and wchar_t, which is in theory good. But there's a catch.

Template libraries are used by copy-and-pasting the code directly into the code compiling against those libraries. Because of how templates get included, even types that nobody outside of the template library knows about are effectively part of the library's ABI. If you change them, two libraries compiled against different versions of the standard library cannot work with each other. And because the template parameter for regex is its character type, those implementation details touch basically everything about the implementation.

The minute libstdc++ (and other standard library implementations) started shipping an implementation of C++ regular expressions, they bound themselves to a specific implementation that could not be changed in a way that impacted the ABI of the library. And while they could cause another ABI break to fix it, standard library implementers don't like breaking ABI because people don't upgrade to standard libraries that break their code.

When C++11 forbade basic_string copy-on-write implementations, libstdc++ had an ABI problem. Their COW string was widely used, and changing it would make code that compiled against the new one break when used with code compiled against the old one. It took years before libstdc++ bit the bullet and actually implemented C++11 strings.

If Regex had been defined without templates, implementations could use traditional mechanisms to hide implementation details. The ABI for the interface to external code could be fixed and unchanging, with only the implementation of the functions behind that ABI changing from version to version.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982