4

Prompted by a comment from Konrad Rudolph on a related question, I wrote the following program to benchmark regular expression performance in F#:

open System.Text.RegularExpressions
let str = System.IO.File.ReadAllText "C:\\Users\\Jon\\Documents\\pg10.txt"
let re = System.IO.File.ReadAllText "C:\\Users\\Jon\\Documents\\re.txt"
for _ in 1..3 do
  let timer = System.Diagnostics.Stopwatch.StartNew()
  let re = Regex(re, RegexOptions.Compiled)
  let res = Array.Parallel.init 4 (fun _ -> re.Split str |> Seq.sumBy (fun m -> m.Length))
  printfn "%A %fs" res timer.Elapsed.TotalSeconds

and the equivalent in C++:

#include "stdafx.h"

#include <windows.h>
#include <regex>
#include <vector>
#include <string>
#include <fstream>
#include <cstdio>
#include <codecvt>

using namespace std;

wstring load(wstring filename) {
    const locale empty_locale = locale::empty();
    typedef codecvt_utf8<wchar_t> converter_type;
    const converter_type* converter = new converter_type;
    const locale utf8_locale = locale(empty_locale, converter);
    wifstream in(filename);
    wstring contents;
    if (in)
    {
        in.seekg(0, ios::end);
        contents.resize(in.tellg());
        in.seekg(0, ios::beg);
        in.read(&contents[0], contents.size());
        in.close();
    }
    return(contents);
}

int count(const wstring &re, const wstring &s){
    static const wregex rsplit(re);
    auto rit = wsregex_token_iterator(s.begin(), s.end(), rsplit, -1);
    auto rend = wsregex_token_iterator();
    int count=0;
    for (auto it=rit; it!=rend; ++it)
        count += it->length();
    return count;
}

int _tmain(int argc, _TCHAR* argv[])
{
    wstring str = load(L"pg10.txt");
    wstring re = load(L"re.txt");

    __int64 freq, tStart, tStop;
    unsigned long TimeDiff;
    QueryPerformanceFrequency((LARGE_INTEGER *)&freq);
    QueryPerformanceCounter((LARGE_INTEGER *)&tStart);

    vector<int> res(4);

#pragma omp parallel num_threads(4)
    for(auto i=0; i<res.size(); ++i)
        res[i] = count(re, str);

    QueryPerformanceCounter((LARGE_INTEGER *)&tStop);
    TimeDiff = (unsigned long)(((tStop - tStart) * 1000000) / freq);
    printf("(%d, %d, %d, %d) %fs\n", res[0], res[1], res[2], res[3], TimeDiff/1e6);
    return 0;
}

Both programs load two file as unicode strings (I'm using a copy of the Bible), construct a non-trivial unicode regex \w?\w?\w?\w?\w?\w and split the string four times in parallel using the regex returning the sum of the lengths of the split strings (in order to avoid allocation).

Running both in Visual Studio (with MP and OpenMP enabled for the C++) in release build targeting 64-bit, the C++ takes 43.5s and the F# takes 3.28s (over 13x faster). This does not surprise me because I believe .NET JIT compiles the regex to native code whereas the C++ stdlib interprets it but I'd like some peer review.

Is there a perf bug in my C++ code or is this a consequence of compiled vs interpreted regular expressions?

EDIT: Billy ONeal has pointed out that .NET can have a different interpretation of \w so I have made it explicit in a new regex:

[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]?[0-9A-Za-z_]

This actually makes the .NET code substantially faster (C++ is the same), reducing the time from 3.28s to 2.38s for F# (over 17x faster).

Community
  • 1
  • 1
J D
  • 48,105
  • 13
  • 171
  • 274
  • 2
    If you want great regex performance, use Perl ;) – paulsm4 Nov 05 '13 at 20:50
  • It should also be noted that your C++ example does not handle Unicode the same way the .NET example does. For instance, `\w` matches only `[0-9A-Za-z_]` for `std::regex` (because that's what ECMAScript says), while the .NET example consults locale information to determine what to match. – Billy ONeal Nov 05 '13 at 20:59
  • Which toolset and stdlib are you using? – Mgetz Nov 05 '13 at 21:03
  • @Mgetz: Vanilla VS2010 Ultimate in both cases. – J D Nov 05 '13 at 21:03
  • It's not fair to make C++ do all regex processing multiple times, but do it only once in the C#. You aren't even including the C# regex creation time in your `StopWatch`. – Ben Voigt Nov 05 '13 at 21:05
  • @BenVoigt: Absolutely, good catch. I've moved the regex construction into the loop in the F# to make them comparable. The results are unchanged. – J D Nov 05 '13 at 21:07
  • So Konrad pointed you to Boost::Xpressive as a better-performing regex engine, and then you didn't use it? – Ben Voigt Nov 05 '13 at 21:10
  • @Jon: If there was no difference, than the F# compiler almost certainly hoist the regex out of the inner loop. – Billy ONeal Nov 05 '13 at 22:56
  • 1
    FWIW, this doesn’t surprise me either. Besides concerns about compilation, the regex library in C++ is simply atrocious – we routinely rage about it in the C++ chat. That said, to make the code more comparable I would remove the parallelisation … not that I expect much change. – Konrad Rudolph Nov 05 '13 at 23:08
  • 1
    @Ben To be fair, (the relevant part of) Boost.Xpressive is restricted to compile-time created regex. That’s probably not what Jon wanted to compare. – Konrad Rudolph Nov 05 '13 at 23:10
  • What happens if you remove the parallelization from both versions of the code? You might be paying for initialization costs of the OpenMP runtime within your C++ version, thus skewing the results. For kicks, it'd also be interesting to measure the performance of the .NET regex *without* `RegexOptions.Compiled`. – Jack P. Nov 06 '13 at 00:18
  • 1
    Your questions make me wanna learn functional programming.. – Nils Nov 11 '13 at 09:51

1 Answers1

11

These benchmarks aren't really comparable -- C++ and .NET implement completely different regular expression languages (ECMAScript vs. Perl), and are powered by completely different regular expression engines. .NET (to my understanding) is benefiting from the GRETA project here, which produced an absolutely fantastic regular expression engine which has been tuned for years. The C++ std::regex in comparison is a recent addition (at least on MSVC++, which I'm assuming you're using given the nonstandard types __int64 and friends).

You can see how GRETA did vs. a more mature std::regex implementation, boost::regex, here (though that test was done on Visual Studio 2003).

You also should keep in mind that regex performance is highly dependent on your source string and on your regex. Some regex engines spend lots of time parsing the regex to go faster through more source text; a tradeoff that makes sense only if you are parsing lots of text. Some regex engines trade off scanning speed for being relatively expensive to make matches (so number of matches would have an effect). There are huge numbers of tradeoffs here; one pair of inputs really is going to cloud the story.

So to answer your question more explicitly: this kind of variation is normal across regex engines, be they compiled or interpreted. Looking at boost's tests above, often the difference between the fastest and slowest implementations were hundreds of times different -- 17x isn't all that strange depending on your use case.

Nic
  • 6,211
  • 10
  • 46
  • 69
Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • How do ECMAScript and Perl regexs interpret this particular regex differently? – J D Nov 05 '13 at 20:59
  • @Jon: See my comment on your question -- long story short, `\w` matches different things. – Billy ONeal Nov 05 '13 at 21:00
  • @Jon: Also, just because they don't treat this regex differently doesn't mean that there aren't tradeoffs in their design to support different features. – Billy ONeal Nov 05 '13 at 21:05
  • Do you think design tradeoffs explain the 17x performance difference here or is it compilation vs interpretation? – J D Nov 05 '13 at 21:11
  • @Jon: I think just being different engines explain the 17x performance difference. Boost's numbers comparing 5 different interpreted regex engines (GRETA, `boost::regex`, `boost::regex` with locales disabled, POSIX, and PCRE) show hundreds of times speed differences in many cases. – Billy ONeal Nov 05 '13 at 21:14
  • @Jon: You're also comparing an old, mature, and well tuned regex engine with a brand new regex engine. It isn't surprising that the older, well tuned engine wins. – Billy ONeal Nov 05 '13 at 21:17
  • "show hundreds of times speed differences in many cases". That only seems to be the case for "POSIX". Comparing GRETA and Boost I cannot see a single test case where the performance difference was as big as the one I have observed here. – J D Nov 05 '13 at 21:38
  • "You're also comparing an old, mature, and well tuned regex engine with a brand new regex engine. It isn't surprising that the older, well tuned engine wins". Then you would expect `` in VS2013 to be substantially better because it is 3 years more mature? – J D Nov 05 '13 at 21:39
  • @Jon: I'm not sure what changes have happened to VC's `std::regex` -- but it wouldn't surprise me either way. The person who maintains the regex library also had to essentially rewrite the entire STL in this last release to support variadic templates, so I'm not sure if it has gotten much love. You might want to compare with `boost::regex` (though again I'm not sure what difference you'll get). – Billy ONeal Nov 05 '13 at 22:56