2

I am trying to parse a build log file to get some information, using regular expressions. I am trying to use regular expression like ("( {9}time)(.+)(c1xx\\.dll+)(.+)s") to match a line like time(D:\Program Files\Microsoft Visual Studio 11.0\VC\bin\c1xx.dll)=0.047s

This is taking about 120 s to complete, in a file which has 19,000 lines. some of which are pretty large. Basic problem is when I cut the number of lines to about 19000, using some conditions, it did not changed anything, actually made it worse. I do not understand, if I remove the regular expressions altogether, only scanning the file takes about 6s. That means regular expressions are the main time consuming process here. So why the does not go at least some amount lower when I removed half of the lines.

Also, can anyone tell me what kind of regular expression is faster, more generic one or more specific one. i.e. I can match this line time(D:\Program Files\Microsoft Visual Studio 11.0\VC\bin\c1xx.dll)=0.047s uniquley in file using this regex also - ("(.+)(c1xx.dll)(.+)"). But it makes the whole thing to run even slower but when I use something like ("( {9}time)(.+)(c1xx\\.dll+)(.+)") It makes it run slightly faster.

I am using c++ 11 regex library and mostly regex_match function.

regex c1xx("( {9}time)(.+)(c1xx\\.dll+)(.+)s");
auto start = system_clock::now();
int linecount = 0;
while (getline(inFile, currentLine))
{
    if (regex_match(currentLine.c_str(), cppFile))
    {
        linecount++;
        // Do something, just insert it into a vector
    }
}

auto end = system_clock::now();
auto elapsed = duration_cast<milliseconds>(end - start);
cout << "Time taken for parsing first log = " << elapsed.count() << " ms" << " lines = " << linecount << endl;

Output:

Time taken for parsing first log = 119416 ms lines = 19617

regex c1xx("( {9}time)(.+)(c1xx\\.dll+)(.+)s");
auto start = system_clock::now();
int linecount = 0;
while (getline(inFile, currentLine))
{
    if (currentLine.size() > 200)
    {
        continue;
    }

    if (regex_match(currentLine.c_str(), cppFile))
    {
        linecount++;
        // Do something, just insert it into a vector
    }
}

auto end = system_clock::now();
auto elapsed = duration_cast<milliseconds>(end - start);
cout << "Time taken for parsing first log = " << elapsed.count() << " ms" << " lines = " << linecount << endl;

Output:

Time taken for parsing first log = 131613 ms lines = 9216

Why its taking more time in the second case ?

ildjarn
  • 62,044
  • 9
  • 127
  • 211
ocwirk
  • 1,079
  • 1
  • 15
  • 35
  • 4
    Code would be appreciated a lot more than an english description. – Martin York Jun 21 '12 at 23:59
  • 3
    On the eighth day, God said "How am I ever going to find anything in all this crap? Let there be grep." And there *was* grep. And there was much rejoicing. – Peter Wone Jun 22 '12 at 00:26
  • There, pretty much all the code, I dont know how to put the 2nd part of question in code, just asking which type of regex would be faster? a generic one or a more specific tighter kind of regex ? – ocwirk Jun 22 '12 at 00:33
  • 3
    You are measuring release builds, not debug builds, correct? – ildjarn Jun 22 '12 at 00:40
  • no, I just measured debug builds, is it going to have some effect ? – ocwirk Jun 22 '12 at 07:11
  • 1
    Measuring performance without enabling optimizations is **100% pointless**. See the comments on [this answer](http://stackoverflow.com/q/10887668/636019) for further explanation. – ildjarn Jun 22 '12 at 15:47

1 Answers1

2

So why the does not go at least some amount lower when I removed half of the lines.

Why its taking more time in the second case ?

It is conceivable that the regex library is somehow able to filter out lines more efficiently than your size check. It is also possible that the introduction of an additional branch in your while loop is confusing the compiler's branch prediction, and so you are not getting optimal instruction pipelining/prefetching.

Also, can anyone tell me what kind of regular expression is faster, more generic one or more specific one.

If the expression ("(.+)(c1xx.dll)(.+)") would work, I believe (".+c1xx\\.dll.+") would also work, and regex won't bother saving match positions for you.

Community
  • 1
  • 1
jxh
  • 69,070
  • 8
  • 110
  • 193
  • The first part looks correct, but in the second part aren't the two regular expressions you have written exactly the same ? I was actually asking like if I have "ab.+c1xx\\.dll.+cd" as regex would it be faster / slower / no effect, compared to ".+c1xx\\.dll.+", given both uniquely identify my string. Thanks for the answer. – ocwirk Jun 22 '12 at 07:15
  • 1
    ".+c1xx\\.dll.+" is definitely faster – dilip kumbham Jun 22 '12 at 11:15