4

Update 2: Actually it's the regex(".{40000}");. That alone already takes that much time. Why?


regex_match("", regex(".{40000}")); takes almost 8 seconds on my PC. Why? Am I doing something wrong? I'm using gcc 4.9.3 from MinGW on Windows 10 on an i7-6700.

Here's a full test program:

#include <iostream>
#include <regex>
#include <ctime>
using namespace std;

int main() {
    clock_t t = clock();
    regex_match("", regex(".{40000}"));
    cout << double(clock() - t) / CLOCKS_PER_SEC << endl;
}

How I compile and run it:

C:\Users\ ... \coding>g++ -std=c++11 test.cpp

C:\Users\ ... \coding>a.exe
7.643

Update: Looks like the time is quadratic in the given number. Doubling it roughly quadruples the time:

  10000   0.520 seconds (factor 1.000)
  20000   1.922 seconds (factor 3.696)
  40000   7.810 seconds (factor 4.063)
  80000  31.457 seconds (factor 4.028)
 160000 128.904 seconds (factor 4.098)
 320000 536.358 seconds (factor 4.161)

The code:

#include <regex>
#include <ctime>
using namespace std;

int main() {
    double prev = 0;
    for (int i=10000; ; i*=2) {
        clock_t t0 = clock();
        regex_match("", regex(".{" + to_string(i) + "}"));
        double t = double(clock() - t0) / CLOCKS_PER_SEC;
        printf("%7d %7.3f seconds (factor %.3f)\n", i, t, prev ? t / prev : 1);
        prev = t;
    }
}

Still no idea why. It's a very simple regex and the empty string (though it's the same with short non-empty strings). It should fail instantly. Is the regex engine just weird and bad?

Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
  • What are you trying to accomplish? – Tim Biegeleisen Dec 10 '16 at 11:46
  • @TimBiegeleisen Does that really matter? I'm trying to check words like "internationalization" against abbreviations like "i12iz4n" by translating the latter to regexes like "i.{12}iz.{4}n". – Stefan Pochmann Dec 10 '16 at 11:49
  • 1
    Perhaps you need to get a newer version of g++. I just ran your program with gcc 6.2.0 and it finished in 118ms. – Waxrat Dec 10 '16 at 11:54
  • Do not allow processing empty strings, check them before passing to the `regex_match` method. BTW, on GCC 5.1, [it takes much less time](http://ideone.com/MZT017). – Wiktor Stribiżew Dec 10 '16 at 11:54
  • @Waxrat Looks like I already have the newest version that MinGW offers. Also, 118ms is still a lot. I'd expect it to be done instantly, as it has pretty much nothing to do. – Stefan Pochmann Dec 10 '16 at 12:14
  • @WiktorStribiżew It's not just empty strings, I just made the example minimal. Also, your 22ms is still a lot for pretty much not having to do anything. – Stefan Pochmann Dec 10 '16 at 12:16
  • @WiktorStribiżew I now tried my updated tester there. Looks like it takes linear time and it it causes a "regex_error" for 160000: http://ideone.com/XqnCSQ – Stefan Pochmann Dec 10 '16 at 12:46
  • On my machine your code runs instantly as expected. `$ c++ --version Apple LLVM version 8.0.0 (clang-800.0.42.1) Target: x86_64-apple-darwin16.1.0` – Iulian Dogariu Dec 10 '16 at 13:11
  • The parsing & construction of the regex `regex(".{40000}")` takes most of the time (74ms), and `regex_match` takes 0.6ms. – Waxrat Dec 10 '16 at 13:48
  • Just tested the second code on my pc (i7-4790K). The factor is constantly about 2. Using g++ 6.2.1 or clang++ 3.9 doesn't make much difference. Without optimization the example with i=80000 runs in about 0.1sec, with optimization in about 0.02sec – msrd0 Dec 10 '16 at 14:07
  • Using g++ 4.8 actually makes a huge difference - the i=80000 example runs in 0.000sec. The first one that needs 0.001sec is i=640000. Can't explain why – msrd0 Dec 10 '16 at 14:10
  • And g++ 4.6 is even quicker and the first one that needs 0.001sec is i=1280000 - really strange – msrd0 Dec 10 '16 at 14:12
  • @msrd0 That difference between your 4.6 and your 4.8 could maybe be just random variation. Do you happen to have 4.9 as well? I guess maybe they extended the regex capabilities in 4.9 but did it in a naive way, and then until 6.2.1 they made it faster... – Stefan Pochmann Dec 10 '16 at 14:20
  • @StefanPochmann that are the default values for both 4.6 and 4.8 after a lot of runs but the difference isn't really big. i can install 4.9 and test it but it will take some time to compile – msrd0 Dec 10 '16 at 14:25
  • @msrd0 Actually apparently [until 4.9, regex isn't supported](http://stackoverflow.com/a/12665408/1672429) but the compiler doesn't say so. – Stefan Pochmann Dec 10 '16 at 14:26
  • @msrd0 Thanks, but don't bother installing 4.9 unless you really want to yourself :-) – Stefan Pochmann Dec 10 '16 at 14:28
  • @StefanPochmann with gcc 4.9.3 I get the same results as you posted so probably you should update your compiler – msrd0 Dec 10 '16 at 14:45

2 Answers2

2

Because it want be fast...

It very possible that transform this regex to another representation (state machine or something else) that is easier and faster to run. C# allow even generating runtime code that represents regex.

In your case you probably hit some bug in that transformation that have O(n^2) complexity.

Yankes
  • 1,958
  • 19
  • 20
1

Measuring the construction and matching separately:

clock_t t1 = clock();
regex r(".{40000}");
clock_t t2 = clock();
regex_match("", r);
clock_t t3 = clock();

cout << double(t2 - t1) / CLOCKS_PER_SEC << '\n'
     << double(t3 - t2) / CLOCKS_PER_SEC << endl;

I see:

0.077336
0.000613
Waxrat
  • 2,075
  • 15
  • 13