2

I am learning regex in c++ and java. So i did a performance test on c++11 regex and java regex with same expression and same no of inputs. Strangely java regex is faster than c++11 regex. Is there anything wrong in my code? Pls correct me

Java code:

import java.util.regex.*;

public class Main {
    private final static int MAX = 1_000_000;
    public static void main(String[] args) {
        long start = System.currentTimeMillis();
        Pattern p = Pattern.compile("^[\\w._]+@\\w+\\.[a-zA-Z]+$");
        for (int i = 0; i < MAX; i++) {
            p.matcher("abcd_ed123.t12y@haha.com").matches();
        }
        long end = System.currentTimeMillis();
        System.out.print(end-start);
    }
}

C++ code:

#include <iostream>
#include <Windows.h>
#include <regex>

using namespace std;

int main()
{
    long long start = GetTickCount64();
    regex pat("^[\\w._]+@\\w+\\.[a-zA-Z]+$");
    for (long i = 0; i < 1000000; i++) {
        regex_match("abcd_ed123.t12y@haha.com", pat);
    }
    long long end = GetTickCount64();
    cout << end - start;
    return 0;
}

Performance:

Java -> 1003ms
C++  -> 124360ms
Eddy
  • 67
  • 7
  • 3
    Obvious question: How did you compile the c++ code? Did you have optimizations on? –  May 30 '18 at 17:29
  • 3
    So you conclude that there's something wrong with your code because Java performs better than C++? – ernest_k May 30 '18 at 17:29
  • 2
    Please confirm you coumpiled your C++ code with optimizations turned on. Also, you're not measuring just the regex, you're also measuring the print statement. I would not consider this a valid test. – Borgleader May 30 '18 at 17:29
  • i think it is because `std::regex` aren't precompiled meening that the regex and the expression to check are parsed at run-time – Tyker May 30 '18 at 17:31
  • Try it without printing the result to get the true value of the regex runtimes. Also, is the regex compiled on the C++ side? – Nick May 30 '18 at 17:31
  • how did you measure? I have the feeling that the numbers you quote primariliy reflect the time your code needs to print something on the console and the time needed for the regex is just some noise on top of that – 463035818_is_not_an_ai May 30 '18 at 17:32
  • @user463035818 The "measuring code" is in the question... – Borgleader May 30 '18 at 17:33
  • @Borgleader ups, then my feeling was right ;) – 463035818_is_not_an_ai May 30 '18 at 17:34
  • i use visual studio17 for compiling c++ – Eddy May 30 '18 at 17:38
  • 1
    Did you time a Debug or Release build? There can be a significant performance difference between them (Release enables optimizations). – Blastfurnace May 30 '18 at 17:42
  • 1
    Right! In Release build I get 1672ms for VS2017. – Bo Persson May 30 '18 at 17:46
  • i think i have problem in my VS build settings. what should i change?? or can u tell me the command line compilation code. – Eddy May 30 '18 at 17:54
  • Downvoters it would be nice if u give a solution. – Eddy May 30 '18 at 17:54
  • In release build I get no output at all for GCC. In case it helps: here's the hotspot JVM assembly output [from `LD_PRELOAD=/home/sehe/Projects/stackoverflow/fcml-1.1.3/example/hsdis/.libs/libhsdis-amd64.so ./jre1.8.0_171/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Main 2>&1 > disasm.a`](http://stackoverflow-sehe.s3.amazonaws.com/fea76143-b712-4df9-97c3-4725b2f9e695/disasm.a.xz) – sehe May 30 '18 at 18:13
  • This is going to depend on the regex engine. The one in `GCC` for example is not the fastest. Try it with `PCRE` (pcrecpp) and it should be a great deal faster. – Galik May 30 '18 at 18:58
  • In my tests, `GCC` regex is 20% slower than the `Java` one but using `PCRE` regex is 3 times *faster* that the `Java` test. – Galik May 30 '18 at 19:05
  • The MSVC regex implementation is [notoriously slow and in need of an overhaul](https://twitter.com/StephanTLavavej/status/958129309130014720). A Better comparison would be boost::regex. – Mgetz May 30 '18 at 19:34
  • Related https://stackoverflow.com/a/14229152/332733 – Mgetz May 30 '18 at 19:45
  • And LLVM's libc++'s std::regex is even faster (2x) than boost regex in my tests. – 0xB00B Sep 26 '21 at 15:31

3 Answers3

16

Made the C++ sample portable:

#include <iostream>
#include <chrono>
#include <regex>

using C = std::chrono::high_resolution_clock;
using namespace std::chrono_literals;

int main()
{
    auto start = C::now();
    std::regex pat("^[\\w._]+@\\w+\\.[a-zA-Z]+$");
    for (long i = 0; i < 1000000; i++) {
        regex_match("abcd_ed123.t12y@haha.com", pat);
    }
    std::cout << (C::now() - start)/1.0ms;
}

On linux, and with clang++ -std=c++14 -march=native -O3 -o clang ./test.cpp I get 595.970 ms. See also Live On Wandbox

The java runs in 561 ms, on the same machine.

Update: Boost Regex runs much faster, see below comparative benchmark

Caveat: synthetic benchmarks like these are very prone to error: the compiler might sense that no observable side effects are done, and optimize the whole loop out, just to give an example.

More Fun: Adding Boost To The Mix

Using Boost 1.67 and Nonius Micro-Benchmarking Framework

enter image description here

We can see that Boost's Regex implementations are considerably faster.

See the detailed sample data interactive online: https://plot.ly/~sehe/25/

Code Used

#include <iostream>
#include <regex>
#include <boost/regex.hpp>
#include <boost/xpressive/xpressive_static.hpp>
#define NONIUS_RUNNER
#include <nonius/benchmark.h++>
#include <nonius/main.h++>

template <typename Re>
void test(Re const& re) {
    regex_match("abcd_ed123.t12y@haha.com", re);
}

static const std::regex std_normal("^[\\w._]+@\\w+\\.[a-zA-Z]+$");
static const std::regex std_optimized("^[\\w._]+@\\w+\\.[a-zA-Z]+$", std::regex::ECMAScript | std::regex::optimize);
static const boost::regex boost_normal("^[\\w._]+@\\w+\\.[a-zA-Z]+$");
static const boost::regex boost_optimized("^[\\w._]+@\\w+\\.[a-zA-Z]+$", static_cast<boost::regex::flag_type>(boost::regex::ECMAScript | boost::regex::optimize));

static const auto boost_xpressive = []{
    using namespace boost::xpressive;
    return cregex { bos >> +(_w | '.' | '_') >> '@' >> +_w >> '.' >> +alpha >> eos };
}();

NONIUS_BENCHMARK("std_normal",      [] { test(std_normal);      })
NONIUS_BENCHMARK("std_optimized",   [] { test(std_optimized);   })
NONIUS_BENCHMARK("boost_normal",    [] { test(boost_normal);    })
NONIUS_BENCHMARK("boost_optimized", [] { test(boost_optimized); })
NONIUS_BENCHMARK("boost_xpressive", [] { test(boost_xpressive); })

Note Here's the output of the Hotspot JVM JIT compiler:

This was generated using

LD_PRELOAD=/home/sehe/Projects/stackoverflow/fcml-1.1.3/example/hsdis/.libs/libhsdis-amd64.so ./jre1.8.0_171/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Main 2>&1 > disasm.a

sehe
  • 374,641
  • 47
  • 450
  • 633
  • 1
    Try passing [`std::regex::optimize`](http://en.cppreference.com/w/cpp/regex/syntax_option_type) to the constructor to force a FSA creation (in theory) – Mgetz May 30 '18 at 19:50
  • 1
    @Mgetz optimize makes the engine generate a DFA (deterministic finite automaton) instead of an NFA (non-deterministic finite automaton). They're both Finite State Automata, but a DFA takes more effort to generate (although it will run faster and cannot represent all regexes). – Stephen M. Webb May 30 '18 at 20:11
  • @Mgetz Or `optimize` does something else than we expect. Adding [clang+libc++](https://plot.ly/~sehe/19/). Also, for fun: comparative benchmark using GCC;libstdc++;Boost1.67: https://plot.ly/~sehe/21/ ([benchmark code](https://paste.ubuntu.com/p/FYn4nmfSZn/)) – sehe May 30 '18 at 20:19
  • Added a Boost Xpressive [static regex](https://www.boost.org/doc/libs/1_67_0/doc/html/xpressive/user_s_guide.html#xpressive.user_s_guide.creating_a_regex_object) into the mix. The Benchmark is still at only [<30 LoC](https://paste.ubuntu.com/p/hTJnC3WjFr/) and the results are still [clearly favouring Boost Regex](https://plot.ly/~sehe/23/) – sehe May 30 '18 at 20:53
  • 1
    @rustyx That was a good call. Adding `ECMAScript | optimized` made the optimized regexen slightly faster - making a bit more sense. [Updated the answer](https://stackoverflow.com/posts/50611867/revisions) (again). – sehe May 30 '18 at 21:07
  • 1
    Also the speed is implementation dependent, in my tests on arm64 clang on linux with full optimizations in release mode, I found that `libc++` regex implementation is about 16x faster than `libstdc++` one. And about 2x faster than boost version. – 0xB00B Sep 26 '21 at 15:25
5

Just to add to the already provided answers...

Yes, the C++11 std::regex is slightly slower (even in Release mode) than the Java one.

But PCRE2 with JIT is 3 times faster:

#include <iostream>
#include <chrono>
#define PCRE2_STATIC
#define PCRE2_CODE_UNIT_WIDTH 8
#include "pcre2.h"

using namespace std;
using namespace std::chrono;
using namespace std::chrono_literals;

int main()
{
    auto start = high_resolution_clock::now();
    int errn;
    PCRE2_SIZE erroffset;
    auto pattern = (PCRE2_SPTR8)"^[\\w._]+@\\w+\\.[a-zA-Z]+$";
    pcre2_code* re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, &errn, &erroffset, nullptr);
    if (!re)
        cerr << "pcre2_compile failed\n";
    pcre2_match_data* match_data = pcre2_match_data_create_from_pattern(re, nullptr);
    for (long i = 0; i < 1000000; i++) {
        auto text = (PCRE2_SPTR8)"abcd_ed123.t12y@haha.com";
        int rc = pcre2_match(re, text, PCRE2_ZERO_TERMINATED, 0, 0, match_data, nullptr);
        if (rc <= 0)
            cerr << "pcre2_match failed\n";
    }
    auto end = high_resolution_clock::now();
    cout << (end - start) / 1ms << "\n";
    return 0;
}

Results:

  • PCRE2 v10.21: 139ms
  • Java: 440ms
rustyx
  • 80,671
  • 25
  • 200
  • 267
  • "JIT" implies PCRE does compile directly into machine instructions? From what I've seen regex usually compile into DFA, with the engine being highly optimized, but clearly ahead-of-time compiled code. +1 for the data point – sehe May 31 '18 at 21:16
  • 2
    @sehe yes, it compiles [directly](https://github.com/luvit/pcre2/tree/master/src/sljit) into machine instructions. Kind of hard to beat – rustyx May 31 '18 at 21:45
  • TIL. Awesome library. Nice addition! – sehe May 31 '18 at 21:46
2

As various commenters have pointed out it sounds like you are compiling your C++ code in debug mode which turns off many compiler optimizations and adds some extra diagnostic code to your program.

Since you are using Visual Studio 2017 look for the Solution Configuration drop-down and change it from Debug to Release.

screenshot of Visual Studio 2017

Frank Boyne
  • 4,400
  • 23
  • 30