1

I have a log file, which consists of 400k log lines. I found out, that my c++ code is very slow in comparison to perl code. So I made a simple iteration over my log file and used regex of c++ and of perl. Perl scripts executes very fast while on the other hand c++ is taking time.

In c++ i have in use #include<regex> library. Whereas in perl, regex can be used directly. How can I make c++ code as efficient as perl? Since perl's implementation is by C only.

regex log_line("(\\d{1,2}\\/[A-Za-z]{3}\\/\\d{1,4}):(\\d{1,2}:\\d{1,2}:\\d{1,2}).*?\".*?[\\s]+(.*?)[\\s\?].*?\"[\\s]+([\\d]{3})[\\s]+(\\d+)[\\s]+\"(.*?)\"[\\s]+\"(.*?)\"[\\s]+(\\d+)");
string line;
int count =0;
smatch match;
while(getline(logFileHandle, line){
    if(regex_search(line , match , log_line)==true){
    count++
}


open(N==LOG_FILE,"<$log_file_location");
        my $count=0;
        while($thisLine = <=LOG_FILE>){
            if((($datePart, $time, $requestUrl, $status, $bytesDelivered, $httpReferer, $httpUserAgent, $requestReceived) = $thisLine =~ /(\d{1,2}\/[A-Za-z]{3}\/\d{1,4}):(\d{1,2}:\d{1,2}:\d{1,2}).*?\".*?[\s]+(.*?)[\s\?].*?\"[\s]+([\d]{3})[\s]+(\d+)[\s]+\"(.*?)\"[\s]+\"(.*?)\"[\s]+(\d+)/o) == 8){
                $count++;
            }
        }

I'm afraid, if my question is not in the right format or something is missing let me know. Thanks.

EDIT 1 So I used chrono library in c++ to find out the time taken. Below is the output result. I took a sample of log file to make things easy. Simply reading the log file and counting no. of lines takes 57 ms. When regex_search is used it takes a whopping 2462 ms for the same sample log file.

No of Lines27399
With regex + logfileRead
Time taken by function: 2462 milliseconds
No of Lines27399
With just simple logfileRead
Time taken by function: 57 milliseconds
Shray
  • 181
  • 1
  • 9
  • Most C++ standard library regex implementations are not known for their speed. Stick with Perl, or use a more optimized/better performing C++ regex library. – Shawn Oct 31 '18 at 18:59
  • Are you compiling C++ with optimizations enabled? If you're matching the entire log line, try using `regex_match` instead of `regex_search`. – Praetorian Oct 31 '18 at 19:02
  • Do you know any optimized regex library? @Shawn No, I am just compiling `g++ log.cpp` @Praetorian Okay, let me try. But still, if I had to use regex_search only, is there any way to perform at par with perl? – Shray Oct 31 '18 at 19:34
  • 1
    @Praetorian If the bottleneck is the regexp matching, optimizing his code is not likely to make too much difference. – Barmar Oct 31 '18 at 20:44
  • 1
    Your C++ code isn't valid. – NetMage Oct 31 '18 at 21:24
  • @NetMage with due respect sir, this is just a part of the code to make things crisp and clear, and the code is running fine. :D – Shray Nov 01 '18 at 06:19
  • 1
    Can you explain where the close brace for the `if` is in your sample code? What about the "statement" `count++` that doesn't end in a semicolon? Did you copy and paste it into a C++ compiler? – NetMage Nov 01 '18 at 21:06
  • Are you able to change the regex engine you're using? Something like [pcrecpp](https://www.pcre.org/original/doc/html/pcrecpp.html) which implements PCRE for C++ might make a difference. Keep in mind this type of data crunching is what Perl was built for, so matching Perl's speed vs making your C++ regex faster may be two different goals. Edit: this [stackoverflow question](https://stackoverflow.com/questions/41481811/why-pcre-regex-is-much-faster-than-c11-regex) may add some perspective. – interduo Nov 01 '18 at 22:12

2 Answers2

2

Use a code generator tool like re2c or ragel to compile your regular expression into C/C++ code (which can be optimized by the compiler).

Alternatively, Boost.Regex -- which was the basis for std::regex -- may be faster than your std::regex implementation.

Also, the bottleneck might be I/O rather than regular expressions. Why is reading lines from stdin much slower in C++ than Python?

Kelvin Sherlock
  • 825
  • 7
  • 8
  • Okay, I need to look up on re2c or ragel tool. Kindly look in my edit, I have shared the time taken to read file and time taken to read and use regex. But the point is, I don't think reading the log file is a bottleneck here. Thanks ! – Shray Nov 01 '18 at 06:26
0

On using boost::regex c++ code flies like a jet. The std::regex is not optimised and made for performance.

Shray
  • 181
  • 1
  • 9
  • 2
    This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. – Johan Jul 04 '19 at 10:45