11

Could it be that python's C regex implementation is 6 times faster or am I missing something ?

Python version:

import re
r=re.compile(r'(HELLO).+?(\d+)', re.I)
s=r"prefixdfadfadf adf adf adf adf he asdf dHello Regex 123"

%timeit r.search(s)

1000000 loops, best of 3: 1.3 µs per loop (769,000 per sec)

C++11 version:

#include<regex>
int main(int argc, char * argv[])
{
    std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
    std::regex my(R"((HELLO).+?(\d+))", regex_constants::icase);

    bench_utils::run(std::chrono::seconds(10),
        [&]{
        std::smatch match;
        bool found = std::regex_search(s, match, my);
    });       
    return 0;
}

Results in about ~125,000 searches/second

Edit: Here is the code for bench_utils:

namespace bench_utils
{
    template<typename T>    
    inline std::string formatNum(const T& value)
    {
            static std::locale loc("");
            std::stringstream ss;
            ss.imbue(loc);
            ss << value;
            return ss.str();
        }

    inline void run(const std::chrono::milliseconds &duration,
        const std::function<void() >& fn)
    {
        using namespace std::chrono;
        typedef steady_clock the_clock;
        size_t counter = 0;
        seconds printInterval(1);
        auto startTime = the_clock::now();
        auto lastPrintTime = startTime;
        while (true)
        {
            fn();
            counter++;
            auto now = the_clock::now();
            if (now - startTime >= duration)
                break;
            auto p = now - lastPrintTime;
            if (now - lastPrintTime >= printInterval)
            {
                std::cout << formatNum<size_t>(counter) << " ops per second" << std::endl;
                counter = 0;
                lastPrintTime = the_clock::now();
            }
        }
    }

}
jfs
  • 399,953
  • 195
  • 994
  • 1,670
GabiMe
  • 18,105
  • 28
  • 76
  • 113
  • 4
    Related: http://stackoverflow.com/questions/14205096/c11-regex-slower-than-python – devnull Feb 26 '14 at 07:23
  • Please always remember to read the helpful descriptions that appear when selecting tags! – Charles Feb 26 '14 at 08:26
  • 1
    How are we supposed to know what `bench_utils` does? It's not standard C++ and I don't see the include file for that either. – Ali Feb 26 '14 at 12:44
  • @Ali I added the bench_utils code to the OP – GabiMe Feb 26 '14 at 12:53
  • 1
    Have you tried a different regex library for C++? – Niklas B. Feb 26 '14 at 13:02
  • 1
    @GabiMe OK, now that I see your code, here are two things I certainly would *not* do in that infinite loop: (1) pass in the function under study as a heavy weight `std::function` and (2) call `the_clock::now();` in each iteration. I don't know how much it matters but I probably wouldn't create new `std::smatch match;` in each iteration. In any case, I would try to work harder to make sure that the Python code and the C++ code do the same thing (as much as possible). – Ali Feb 26 '14 at 13:04
  • @Ali (1) the std::function is passed only once. (2)the_clock::now() is very fast,- millions of times per second (3) About std::smatch match, I tried to put it out of the loop but it didn't affect the results – GabiMe Feb 26 '14 at 13:45
  • @GabiMe The primary problem with the `std::function` is not that it is being passed but that *calling* the underlying function can be expensive. Whether that matters in your application I don't know. "`the_clock::now()` is very fast, millions of times per second." Well, that can already make up, say, 10% of the time measured. I wouldn't measure unrelated things together with the thing I am interested in. You could give boost regex a shot, that is probably more mature than the implementation in VS. Sorry for nitpicking... – Ali Feb 26 '14 at 14:04
  • 1
    Have you tried other regex features, or not capturing, assertions, etc.? – quantum Apr 24 '14 at 21:46
  • A key thing to remember is that Python is interpreted, whereas C/C++ are compiled. Interpreting a regular expression performs far faster than compiling it. `re.compile()` does not actually *compile* anything, it merely creates a re object in Python so that a regular expression can be referred to time and time again without re-writing it. Take a peek at the `re.py` source. – signus May 07 '14 at 23:02
  • @Signus This has nothing to do with Python being compiled or interpreted. FYI, the RE implementation is entirely in C and `re.compile` does compile the RE to a kind of state machine. – Fred Foo May 18 '14 at 21:33
  • @larsmans Well C/C++ recompiles the regular expression every single time, whereas Python compiles it once and reuses the Python object over and over again (where the interpretation matters). Note: All regular expressions have to be Finite State Machines to be compiled. – signus May 18 '14 at 23:23
  • @Signus Except that the REs in Python are not true REs, and the things they compile to are not really finite state machines. – Fred Foo May 19 '14 at 06:57
  • Are you compiling in release mode? – n. m. could be an AI Jul 13 '14 at 05:39

2 Answers2

8

The first thing to note is that in Python, regex (whether using the re, or regex module) occurs 'at the speed of c', that is the actual heavy lifting code is cold hard c, and thus at least for longer strings the performance is going to depend on the c regexp implementation.

Sometimes python is pretty clever, python has no trouble performing in the vicinity of tens of millions of operations per second and it can create millions of objects per second - this is a thousand times slower than c, but if we're talking something that takes microseconds to begin with, the python overhead may not really matter, it will only add 0.1 microseconds to each function call.

So in this case the relative slowness of Python doesn't matter. It's fast enough in absolute terms that what matters is how fast the regular expression functions do their thing.

I rewrote the c++ case to be not subject to any criticisms (I hope, feel free to point out any), in fact it doesn't even need to create a match object as search simply returns a bool (true/false):

#include <regex>
#include <iostream>

int main(int argc, char * argv[])
{
    std::string s = "prefixdfadfadf adf adf adf adf he asdf dHello Regex 123";
    std::regex my(R"((HELLO).+?(\d+))", std::regex_constants::icase);

    int matches = 0;
    for (int i = 0; i < 1000000; ++i)
        matches += std::regex_search(s, my);


    std::cout << matches  << std::endl;
    return 0;
}

I wrote a comparable python program (although python did create and return a match object) and my results were exactly the same as yours

c++   : 6.661s
python: 1.039s

I think the basic conclusion here is that Python's regex implementation simply thrashes the c++ standard library one.

It thrashes Go too

A while back just for fun I compared Python's regex performance with Go's regex performance. And python was at least twice as fast.

The conclusion is that python's regexp implementation is very good and you should certainly not look outside Python to get improved regexp performance. The work regular expression do is fundamentally time consuming enough that Python's overhead doesn't really matter in the slightest and Python's got a great implementation (and the new regex module is often even faster than re).

Blake Walsh
  • 1,501
  • 14
  • 10
1

Using timeit to do benchmarks is wrong since it gives you best of 3 and not a statistical difference test.

It's your code, not the language.

  1. Passing the function as a std::function will make the C++ code slower;
  2. Calling clock functions in every iterations;
  3. Creating new objects, such as the std::smatch match; in each iteration;
  4. The run function;
  5. Not precompiling the regex.

I also wonder what optimization you are running with.

The run() function is doing too much. Fix that. :)

Unihedron
  • 10,902
  • 13
  • 62
  • 72
Good Person
  • 1,437
  • 2
  • 23
  • 42
  • 4
    The 1. is an invalid argument. Python has much more overhead for every operation, so *if* it was a significant argument then it would a *negative* argument for C++. Even with that *tiny* overhead C++ **ought** to be faster since it avoids tons of other overhead that the python version has. Also 3. is invalid since python *is* creating new match objects for every call to `search`, and C++ ought to be faster in that too. The actual important points are 2 and 5, especially 5 I believe. – Bakuriu Oct 01 '14 at 08:54