c++11 regex slower than python

Question

hi i would like to understand why the following code which does a split string split using regex

#include<regex>
#include<vector>
#include<string>

std::vector<std::string> split(const std::string &s){
    static const std::regex rsplit(" +");
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = std::sregex_token_iterator();
    auto res = std::vector<std::string>(rit, rend);
    return res;
}

int main(){
    for(auto i=0; i< 10000; ++i)
       split("a b c", " ");
    return 0;
}

is slower then the following python code

import re
for i in range(10000):
    re.split(' +', 'a b c')

here's

> python test.py  0.05s user 0.01s system 94% cpu 0.070 total
./test  0.26s user 0.00s system 99% cpu 0.296 total

Im using clang++ on osx.

compiling with -O3 brings it to down to 0.09s user 0.00s system 99% cpu 0.109 total

Are you running a debug build? When using templates, make sure you have opts on and debug off; there are a lot of safety checks that end up in your code otherwise. — ssube, Jan 07 '13 at 22:24
no just clang++ -o test.o -c -std=c++11 -stdlib=libc++ -Wall -O3 test.cpp — jassinm, Jan 07 '13 at 22:26
They don't do the same thing. For example, the C++ code does string concatenation and the Python doesn't. — interjay, Jan 07 '13 at 22:28
I think you spend a lot of time creating temporary strings in C++. And possibly `tosplit + '+'` could be faster than `tosplit + "+"`. — Bo Persson, Jan 07 '13 at 22:29
@interjay good point but this di not make a diff. I updated my question — jassinm, Jan 07 '13 at 22:30
The regex in the case of Python may be compiled/optimized just once. The C++ regex library will build and optimize the regex again and again. Just for the record, try to define the `rsplit` regex as a static constant. In the case of Python, the re library can work with the compiler maintaining a list of optimized regexes. — Diego Sevilla, Jan 07 '13 at 22:31
You're potentially making a huge number of regular expressions in the C++ version. What happens if you move the declaration of `rsplit` outside of `split`? — ssube, Jan 07 '13 at 22:32
making it static or moving it outside goes down to 0.09 using -O3 — jassinm, Jan 07 '13 at 22:34
Ok, so one optimization netted you about 20%. Now find the next one. :) Remember that Python (the interpreter) has already implemented most such optimizations. — DavidO, Jan 07 '13 at 22:36
Also, increase the number of loop executions so that timing is more consistent. — Diego Sevilla, Jan 07 '13 at 22:37
@ Diego Sevilla increasing to 100000 in pytohnn 0.31 vs c++11 -O4 0.9s — jassinm, Jan 07 '13 at 22:41
very related: http://stackoverflow.com/questions/9378500/why-is-splitting-a-string-slower-in-c-than-python — Nate Kohl, Jan 07 '13 at 23:03
This is why people use python for tasks like this: it relieves the programmer of the need to enter into these very technical analyses of what impacts performance. — Marcin, Jan 07 '13 at 23:34
@NateKohl - Nice answer. Additionally, I'd guess that all of the memory management (switching between Iterator and vector). Try returning `rit` immediately, and see if you don't see a massive performance boost. — FrankieTheKneeMan, Jan 07 '13 at 23:47
@FrankieTheKneeMan: I did and performance boost is massive, but python is dumping all into a list. — jassinm, Jan 07 '13 at 23:53
Of course it is, because that's the pythonical way to deal with that object. But C++ (or rather, whoever wrote this library) doesn't see the world in the same way as Python, and as such, only gives you the option to use an iterator. I bet you could write a fairly efficient implementation in C++ yourself, but you've specifically chosen an inefficient way to implement the solution. Efficiency is a game of inches, and while I'm sure python only iterates over the string once while matching - you're doing it at least twice, if not more, depending on how s.end() is implemented. — FrankieTheKneeMan, Jan 08 '13 at 00:01
Which CRT/heap are you using? Python has an optimised small-block heap, so allocating and more importantly deallocating a string "a" does not hurt nearly so much as in a C++ library without such a thing. I suspect this is the source of many "why is faster than C++" problems. — Ben, Jan 08 '13 at 00:16
Ben:strings shorter than 16 characters or something are allocated on the stack on all stl implementations ive seen — Viktor Sehr, Jan 08 '13 at 01:08
I can approximately reproduce your results, and simply replacing libc++'s std::regex with boost::regex makes the C++ version beat python by about 10-15%. I don't think libc++'s implementation is particularly efficient yet. — Cubbi, Jan 08 '13 at 03:45
It occurs to me that having a `vector res` actually allocated in `main()` and saying `res = split("a b c")` might give the optimiser more to work with (edit: or anyway another pattern to match) -- as it is there's no place for the code to put the returned `res` so since the compiler plainly isn't just eliding the entire program as the no-op it is, there are limits to the analysis it's doing. Have you tried this with a recent gcc? — jthill, Jan 08 '13 at 06:15
@ViktorSehr: You have not seen libstdc++ implementation (prior to C++11) then; it used COW and sysmetically required memory allocation. — Matthieu M., Jan 09 '13 at 19:01
There is a proposal in C++1y for `std::basic_string_ref`, a string-like interface with no ownership of the underlying memory. I hope they will make it compatible with this code! — Matthieu M., Jan 09 '13 at 19:35

score 114 · Accepted Answer · edited Jun 20 '20 at 09:12

Notice

See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.

I've augmented the loop to 1000000 to get a better timing measure.

This is my Python timing:

real    0m2.038s
user    0m2.009s
sys     0m0.024s

Here's an equivalent of your code, just a bit prettier:

#include <regex>
#include <vector>
#include <string>

std::vector<std::string> split(const std::string &s, const std::regex &r)
{
    return {
        std::sregex_token_iterator(s.begin(), s.end(), r, -1),
        std::sregex_token_iterator()
    };
}

int main()
{
    const std::regex r(" +");
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r);
    return 0;
}

Timing:

real    0m5.786s
user    0m5.779s
sys     0m0.005s

This is an optimization to avoid construction/allocation of vector and string objects:

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
    auto rend = std::sregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);
    return 0;
}

Timing:

real    0m3.034s
user    0m3.029s
sys     0m0.004s

This is near a 100% performance improvement.

The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by clear(), the vector maintains the memory and construct strings in-place.

Another performance increase would be to avoid construction/destruction std::string completely, and hence, allocation/deallocation of its objects.

This is a tentative in this direction:

#include <regex>
#include <vector>
#include <string>

void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

Timing:

real    0m2.509s
user    0m2.503s
sys     0m0.004s

An ultimate improvement would be to have a std::vector of const char * as return, where each char pointer would point to a substring inside the original s c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1y string_ref in a later sample).

This last improvement could also be achieved with this:

#include <regex>
#include <vector>
#include <string>

void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.push_back(*rit);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v); // the constant string("a b c") should be optimized
                             // by the compiler. I got the same performance as
                             // if it was an object outside the loop
    return 0;
}

I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.

Boost.Regex

This is the boost::regex timing for the c string arguments sample:

real    0m1.284s
user    0m1.278s
sys     0m0.005s

Same code, boost::regex and std::regex interface in this sample are identical, just needed to change the namespace and include.

Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.

EDIT

For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent std::vector<std::string> &v version in anything:

#include <regex>
#include <vector>
#include <string>

template<typename Iterator> class intrusive_substring
{
private:
    Iterator begin_, end_;

public:
    intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}

    Iterator begin() {return begin_;}
    Iterator end() {return end_;}
};

using intrusive_char_substring = intrusive_substring<const char *>;

void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear(); // This can potentially be optimized away by the compiler because
               // the intrusive_char_substring destructor does nothing, so
               // resetting the internal size is the only thing to be done.
               // Formerly allocated memory is maintained.
    while(rit != rend)
    {
        v.emplace_back(rit->first, rit->second);
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<intrusive_char_substring> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);

    return 0;
}

This has to do with the array_ref and string_ref proposal. Here's a sample code using it:

#include <regex>
#include <vector>
#include <string>
#include <string_ref>

void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
    auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
    auto rend = std::cregex_token_iterator();
    v.clear();
    while(rit != rend)
    {
        v.emplace_back(rit->first, rit->length());
        ++rit;
    }
}

int main()
{
    const std::regex r(" +");
    std::vector<std::string_ref> v;
    for(auto i=0; i < 1000000; ++i)
       split("a b c", r, v);

    return 0;
}

It will also be cheaper to return a vector of string_ref rather than string copies for the case of split with vector return.

EDIT 2

This new solution is able to get output by return. I have used Marshall Clow's string_view (string_ref got renamed) libc++ implementation found at https://github.com/mclow/string_view.

#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>

using namespace std;
using namespace std::experimental;
using namespace boost;

string_view stringfier(const cregex_token_iterator::value_type &match) {
    return {match.first, static_cast<size_t>(match.length())};
}

using string_view_iterator =
    transform_iterator<decltype(&stringfier), cregex_token_iterator>;

iterator_range<string_view_iterator> split(string_view s, const regex &r) {
    return {
        string_view_iterator(
            cregex_token_iterator(s.begin(), s.end(), r, -1),
            stringfier
        ),
        string_view_iterator()
    };
}

int main() {
    const regex r(" +");
    for (size_t i = 0; i < 1000000; ++i) {
        split("a b c", r);
    }
}

Timing:

real    0m0.385s
user    0m0.385s
sys     0m0.000s

Note how faster this is compared to previous results. Of course, it's not filling a vector inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for, or even use it to fill a vector.

As ranging over the iterator_range creates string_views over an original string(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.

Just to compare using this split implementation but actually filling a vector we could do this:

int main() {
    const regex r(" +");
    vector<string_view> v;
    v.reserve(10);
    for (size_t i = 0; i < 1000000; ++i) {
        copy(split("a b c", r), back_inserter(v));
        v.clear();
    }
}

This uses boost range copy algorithm to fill the vector in each iteration, the timing is:

real    0m1.002s
user    0m0.997s
sys     0m0.004s

As can be seen, no much difference in comparison with the optimized string_view output param version.

Note also there's a proposal for a std::split that would work like this.

One more thing to try: `static const string s("a b c");` and `split(s,r,v)`. — jthill, Jan 09 '13 at 08:33
@jthill I guess it would improve an std::string argument version, but I guess static is not necessary, just being declared out of loop would do it, instead of the previous construction/destruction from c string. — oblitum, Jan 09 '13 at 08:39
Why did you convert the `std::vector` from the return result to a non-const ref parameter? Aren't RVO and move semantics supposed to deem this unnecessary? Did you measure the improvement? Please add this to the answer if you did measure. — ulidtko, Jan 14 '13 at 18:19
@ulidtko With C++11, I think, `return std::move(some_vector)` should be used instead of expecting RVO. And future compilers should do RVO only when Right-Value-Reference is `return`-ed. — RnMss, Jan 09 '14 at 13:02
@RnMss there's no need for `return std::move(some_vector)` when `some_vector` is a xvalue. I suggest you to look for this keyword on SO. There's no relying on RVO/NRVO. — oblitum, Jan 09 '14 at 13:13
@RnMss you are wrong. See slide 25 of STL's ["Don't help the compiler"](http://channel9.msdn.com/Events/GoingNative/2013/Don-t-Help-the-Compiler) (or 38:00 and onwards in the video). TL;DR: that breaks NRVO. — ulidtko, Jan 09 '14 at 15:34
@ulidtko I was not trying to help the compiler, I was trying to not rely on helping the compiler. But it doesn't matter anymore for the "x-value" things. — RnMss, Jan 30 '14 at 11:39
You forgot to add `std::regex::optimize` to the regex ctor. That would make the regex use a deterministic FSA. — bit2shift, Jun 15 '16 at 13:32
@bit2shift thanks. I had no idea this existed at the time, first time I'm learning of it. Not sure whether it was available at the time for libc++. — oblitum, Jun 15 '16 at 16:07
It's available since C++11, you can [see it here](http://en.cppreference.com/w/cpp/regex/syntax_option_type). — bit2shift, Jun 16 '16 at 08:10
@bit2shift I mean in the implementations. As you may know, for example, the entire regex header was missing from libstdc++ for a long time. — oblitum, Jun 16 '16 at 14:17
As far as I know, it has been present in GCC since 5.2, maybe since 4.9. — bit2shift, Jun 16 '16 at 15:12
Please add a summary to the top of your answer, it's pretty hard now on TL;DR folks :) — Dima Tisnek, Jul 26 '16 at 09:26

score 5 · Answer 2 · answered Jan 09 '13 at 19:30

For optimizations, in general, you want to avoid two things:

burning away CPU cycles for unnecessary stuff
waiting idly for something to happen (memory read, disk read, network read, ...)

The two can be antithetic as sometimes it ends up being faster computing something than caching all of it in memory... so it's a game of balance.

Let's analyze your code:

std::vector<std::string> split(const std::string &s){
    static const std::regex rsplit(" +"); // only computed once

    // search for first occurrence of rsplit
    auto rit = std::sregex_token_iterator(s.begin(), s.end(), rsplit, -1);

    auto rend = std::sregex_token_iterator();

    // simultaneously:
    // - parses "s" from the second to the past the last occurrence
    // - allocates one `std::string` for each match... at least! (there may be a copy)
    // - allocates space in the `std::vector`, possibly multiple times
    auto res = std::vector<std::string>(rit, rend);

    return res;
}

Can we do better ? Well, if we could reuse existing storage instead of keeping allocating and deallocating memory, we should see a significant improvement [1]:

// Overwrites 'result' with the matches, returns the number of matches
// (note: 'result' is never shrunk, but may be grown as necessary)
size_t split(std::string const& s, std::vector<std::string>& result){
    static const std::regex rsplit(" +"); // only computed once

    auto rit = std::cregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = std::cregex_token_iterator();

    size_t pos = 0;

    // As long as possible, reuse the existing strings (in place)
    for (size_t max = result.size();
         rit != rend && pos != max;
         ++rit, ++pos)
    {
        result[pos].assign(rit->first, rit->second);
    }

    // When more matches than existing strings, extend capacity
    for (; rit != rend; ++rit, ++pos) {
        result.emplace_back(rit->first, rit->second);
    }

    return pos;
} // split

In the test that you perform, where the number of submatches is constant across iterations, this version is unlikely to be beaten: it will only allocate memory on the first run (both for rsplit and result) and then keep reusing existing memory.

[1]: Disclaimer, I have only proved this code to be correct I have not tested it (as Donald Knuth would say).

I've done a near exact same implementation, but omitted it because it doesn't improve anything for this sample. Got same performance as the push_back version... — oblitum, Jan 09 '13 at 19:39
Also, by looking over, don't forget to resize the vector for the case of less matches than the initial vector size... hmmm, well, with a size_t return, that's not needed. but I feel it somewhat cumbersome to use... — oblitum, Jan 09 '13 at 19:52
@chico: I agree with the `resize` thing, however the problem of downsizing is that you cause deallocation of the tail `std::string` which will cause reallocation down the road. Of course, a `string_ref` alternative would not suffer such woes. — Matthieu M., Jan 10 '13 at 07:36

schorsch_76 · Answer 3 · 2015-02-07T15:04:02.553

How about this verion? It is no regex, but it solves the split pretty fast ...

#include <vector>
#include <string>
#include <algorithm>

size_t split2(const std::string& s, std::vector<std::string>& result)
{
    size_t count = 0;
    result.clear();
    std::string::const_iterator p1 = s.cbegin();
    std::string::const_iterator p2 = p1;
    bool run = true;
    do
    {
        p2 = std::find(p1, s.cend(), ' ');
        result.push_back(std::string(p1, p2));
        ++count;

        if (p2 != s.cend())
        {
            p1 = std::find_if(p2, s.cend(), [](char c) -> bool
            {
                return c != ' ';
            });
        }
        else run = false;
    } while (run);
    return count;
}

int main()
{
    std::vector<std::string> v;
    std::string s = "a b c";
    for (auto i = 0; i < 100000; ++i)
        split2(s, v); 
    return 0;
}

$ time splittest.exe

real 0m0.132s user 0m0.000s sys 0m0.109s

elxala · Answer 4 · 2016-05-04T01:04:07.607

I would say C++11 regex is much slower than perl and possibly than python.

To measure properly the performance it is better to do the tests using some not trivial expression or else you are measuring everything but the regex itself.

For example, to compare C++11 and perl

C++11 test code

  #include <iostream>
  #include <string>
  #include <regex>
  #include <chrono>

  int main ()
  {
     int veces = 10000;
     int count = 0;
     std::regex expres ("([^-]*)-([^-]*)-(\\d\\d\\d:\\d\\d)---(.*)");

     std::string text ("some-random-text-453:10--- etc etc blah");
     std::smatch macha;

     auto start = std::chrono::system_clock::now();

     for (int ii = 0;  ii < veces; ii ++)
        count += std::regex_search (text, macha, expres);

     auto milli = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();

     std::cout << count << "/" << veces << " matches " << milli << " ms --> " << (milli / (float) veces) << " ms per regex_search" << std::endl;
     return 0;
  }

in my computer compiling with gcc 4.9.3 I get the ouput

 10000/10000 matches 1704 ms --> 0.1704 ms per regex_search

perl test code

  use Time::HiRes qw/ time sleep /;

  my $veces = 1000000;
  my $conta = 0;
  my $phrase = "some-random-text-453:10--- etc etc blah";

  my $start = time;

  for (my $ii = 0; $ii < $veces; $ii++)
  {   
     if ($phrase =~ m/([^-]*)-([^-]*)-(\d\d\d:\d\d)---(.*)/)
     {
        $conta = $conta + 1;
     }
  }
  my $elapsed = (time - $start) * 1000.0;
  print $conta . "/" . $veces . " matches " . $elapsed . " ms --> " . ($elapsed / $veces) . " ms per regex\n";

using perl v5.8.8 again in my computer

  1000000/1000000 matches 2929.55303192139 ms --> 0.00292955303192139 ms per regex

So in this test the ratio C++11 / perl is

   0.1704 / 0.002929 = 58.17 times slower than perl

in real scenarios I get ratios about 100 to 200 times slower. So, for example parsing a big file with a million of lines takes about a second for perl while it can take more minutes(!) for a C++11 program using regex.

I tried the same today (2019) with gcc 8.2 and perl 5.16, and got **1.8 µs per `regex_search` with C++** and **1.5 µs per `regex` with perl**. My point is that the performance is very implementation-dependent and, seemingly, implementation of regexes in **libstdc++** has considerably improved. When I switched to **boost.regex**, I got **0.5 µs per `regex_search` with C++**. That's the power of C++ — you don't get performance automatically, but you can control it. — Daniel Langr, Oct 25 '19 at 05:43

c++11 regex slower than python

4 Answers4

Notice

Boost.Regex

EDIT

EDIT 2

Linked