10

I compared with Linux C regex library,

#include <iostream>
#include <chrono>
#include <regex.h>

int main()
{
    const int count = 100000;

    regex_t exp;
    int rv = regcomp(&exp, R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", REG_EXTENDED);
    if (rv != 0) {
            std::cout << "regcomp failed with " << rv << std::endl;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < count; i++)
    {
            regmatch_t match;
            const char *sz = "http://www.abc.com";

            if (regexec(&exp, sz, 1, &match, 0) == 0) {
    //              std::cout << sz << " matches characters " << match.rm_so << " - " << match.rm_eo << std::endl;
            } else {
    //              std::cout << sz << " does not match" << std::endl;
            }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << elapsed.count() << std::endl;

    return 0;
}

The result is roughly 60-70 milliseconds on my testing machine.

Then I used libc++'s library,

#include <iostream>
#include <chrono>
#include <regex>


int main()
{
        const int count = 100000;

        std::regex rgx(R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", std::regex_constants::extended);
        auto start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                std::cmatch match;
                const char sz[] = "http://www.abc.com";

                if (regex_search(sz, match, rgx)) {
                } else {
                }
        }
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_search: " << elapsed.count() << std::endl;


        start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                const char sz[] = "http://www.abc.com";

                if (regex_match(sz, rgx)) {
                } else {
                }
        }
        end = std::chrono::high_resolution_clock::now();
        elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_match: " << elapsed.count() << std::endl;

        return 0;
}

The result is roughly 2 seconds for both regex_search & regex_match. This is about 30 times slower than C's regex.h library.

Is there anything wrong with my comparison? Is C++'s regex library not suitable for high performance case?

I can understand it's slow because there's no optimization in c++'s regex library yet, but 30 times slower is just too much.

Thanks.


Hi all,

Thanks for answering.

Sorry for my mistake I was using [] for C too but later I changed, and forgot to change C++ code.

I made two changes,

  1. I moved const char sz[] out of the loop for both C & C++.
  2. I compiled it with -O2 (I wasn't using any optimization before), C library's implementation is still around 60 milliseconds, but libc++'s regex now gives a number says, 1 second for regex_search, and 150 milliseconds for regex_match.

This is still a bit slow, but not as much as the original comparison.

user534498
  • 3,926
  • 5
  • 27
  • 52
  • Well, in the second, you're copying the array of characters every time unless it's optimized out (not likely with no optimizations). In the first, you merely point to something. You also have to construct `std::string`s for the calls in the second. – chris Jan 06 '14 at 03:24
  • Compiler version? Exact output? – Lightness Races in Orbit Jan 06 '14 at 03:27
  • 2
    last time I tried, libc++ regex implementation was indeed slow, check this question for comparison with boost::regex, and python: http://stackoverflow.com/questions/14205096 – oblitum Jan 06 '14 at 03:30
  • 5
    Did you use O2 or similar? There's little point profiling C++ code without optimisation, as it's typically written with the conscious decision to use extra layers of abstraction and encapsulation whose cost is only eliminated by optimisation. – Tony Delroy Jan 06 '14 at 03:37
  • Thanks all. I've made new comparison based on the comments. – user534498 Jan 06 '14 at 03:59
  • 1
    As in year 2020, I've tried libc++ and libstdc++ for the above code (compiled with clang++), libc++ takes about 300ms on my machine (windows 10 ubuntu bash), and libstdc++ takes about 60ms. So in terms of regular expression performance, using libstdc++ might be better than libc++ – user534498 Oct 13 '20 at 05:36

3 Answers3

14

If you take a look at http://llvm.org/svn/llvm-project/libcxx/trunk/include/regex you'll see this implementation of regex_match is layered atop regex_search, and all overloads extract sub-expression match positions even if only into local temporaries that are thrown away. regex_search uses a vector of __state objects that have .resize() called on them so are presumably vectors too - all heap allocations and unnecessary when the subexpression matches aren't wanted, but would need to be tracked to support \1 etc in perl-style extensions to regular expressions: the old regcomp/regexec C functions didn't provide those extended features never have to do this extra work. Of course it would be nice if the clang implementation checked the regular expression's need for tracking matches during compilation and called leaner, faster functions to match when possible, but I guess they're just starting with support for the general case.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
11

The following two lines do not do the same thing!

const char  sz1[] = "http://www.abc.com";
const char* sz2   = "http://www.abc.com";

That's already enough to make it an unfair test.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
2

sz and match are loop invariant, you should move them to before (in both cases for sz).

In the second case sz is an initialised array instead of a pointer to a constant literal - that is an unfair and unnecessary difference. That said, if you move the declaration to before the loop as suggested, that should make little or no difference.

Although regex_search() is overloaded for const const char* that may internally cause construction of a std::string, to avoid that possibility you should test it with:

const std::string sz( "http://www.abc.com" ) ;

(again before the loop).

So test:

std::cmatch match;
const char* = "http://www.abc.com";
for (int i = 0; i < count; i++)
{
    if (regex_search(sz, match, rgx)) {
    } else {
    }
}

and

std::cmatch match;
const std::string sz( "http://www.abc.com" )
for (int i = 0; i < count; i++)
{
    if (regex_search(sz, match, rgx)) {
    } else {
    }
}
Clifford
  • 88,407
  • 13
  • 85
  • 165
  • Hi, thanks. The first one is corrected. The second one (change from const char* to std::string) doesn't seem to have any effects on the result (for both regex_search and regex_match). – user534498 Jan 06 '14 at 03:58