8

My Questions is

Is there any regular expression engine that does Just-In-Time compiling during regex pattern parsing and use when matching / replacing the texts? Or where can I learn JIT for i386 or x64 architecture?

Why I need it

I was recently trying to benchmark Python’s built-in regex engine compared with normal C code with around 10MB of data.

I found that for a straightforward replacing (for example ab to zzz) it’s relatively fast: just 2 to 3 times slower than C.

But for [a-z]c it took around 5 to 8 times as much time as C.

And with grouping (e.g. ([a-z])(c) to AA\2\1BB ) it took 20 to 40 times as much time as C.

It’s not Just-In-Time compiling yet, but I think, if I could do Just-In-Time compiling, It could speed up a lot more.

PS: I use profiling for each regex pattern during compiling patterns, for example, profile 1 for simple one like ab, profile 2 for range [a-z]c, profile 3 with grouping ([a-z])(c), each profile has separate codes, so no extra cost needed when matching and replacing simple patterns.

Update 1

I have tried it with psyco, and it doesn’t improve the speed that much. May be because I am doing text replacing against big data, not looping many times.

If I am not wrong, Python’s re.sub is running it natively already I think, so pysco cannot improve the speed that much.

Update 2

I have tried with boost regex wrapped into python, but it’s even slower than Python’s regex, so it seems like the bottleneck is in Python’s string processing and Jan Goyvaerts has also pointed me to that in the answer.

Update

I’d like to convert regex pattern ab[a-z]c to machine code, like the following equivalent C code (*s points to 10MB long texts):

do{
    if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;

Any ideas?

Sebastian Simon
  • 18,263
  • 7
  • 55
  • 75
YOU
  • 120,166
  • 34
  • 186
  • 219
  • 2
    Just as an idea. If you use Boost.Regex what results you will have? –  Nov 20 '09 at 08:41
  • Ok, I will test with Boost too soon. thx – YOU Nov 20 '09 at 08:45
  • Are you trying to speed up regexes for a particular application, or are you trying to learn about JIT compilation? – int3 Nov 20 '09 at 09:11
  • >> Are you trying to speed up regexes for a particular application. Not really for now, I read Russ Cox's why regex are slower, http://swtch.com/~rsc/regexp/regexp1.html , and trying to figure out why its slow , and trying to test it can be improvable or not. If I could do with JIT, I can translate [a-z]c to machine codes real-time during compiling regex pattern, I think It could get very fast speed. just my thinking though. – YOU Nov 20 '09 at 09:16
  • 3
    eh... I believe that article is saying that popular implementations of regex have _algorithmic_ inefficiencies. Getting them compiled might improve things by a constant factor, but it won't fix the underlying problem. – int3 Nov 20 '09 at 09:21
  • right, that article says like that, but I am going bit far , how can I get really fast regex engine, something like that. So onething I comes to my mind is to use JIT. thx anyway. – YOU Nov 20 '09 at 09:26
  • 2
    @Mark: Do Not Comment On Your Own Question. Please UPDATE your question with all the additional answers in the comments. You own the question. You can clarify it. Please clarify the question and make the comments irrelevant. – S.Lott Nov 20 '09 at 11:07
  • @S.Lott: Thanks for pointing out. I will update it. – YOU Nov 20 '09 at 11:52
  • I have updated My Question to Update 2. – YOU Nov 20 '09 at 12:18
  • 1
    I got to this question through the close vote review queue. This has been flagged and close-voted as _off-topic_ with the reason that library and software recommendations are off-topic on this site. However, the description of that reason also recommends to _“describe the problem and what has been done so far to solve it”_. This has been done. This question does not exclusively boil down to a software recommendation request. Leaving open. – Sebastian Simon Sep 10 '15 at 03:47

9 Answers9

5

PCRE has a JIT compiler since 8.20. You can read about here: http://sljit.sourceforge.net/pcre.html

dark100
  • 171
  • 2
  • 1
3

The only regex engine that I know that can compile regular expressions into executable code is the one in .NET when you pass RegexOptions.Compiled. That causes the Regex class to emit MSIL which can then be JITted like any other .NET code.

Whether than makes the .NET regex engine faster than others is a totally different matter. When searching and replacing using relatively simple regular expressions on large data sets, string handling becomes far more important. .NET strings are immutable, so much will depend on how many times the string needs to be reallocated.

Hand-coding the operation will always be faster, because the code isn't equivalent. The regex code maintains certain information about the regex match and the capturing groups which your code does not. In most situations, the extra time you spend hand-coding the search-and-replace instead of using a regex isn't worth the effort, particularly if you factor in that switching to a different regex is trivial when your requirements change, while rewriting the search-and-replace using procedural code takes much more time.

In my experience, PCRE is one of the fastest regex engines around. It doesn't include a ready-made search-and-replace, however.

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
2

I'm no Python expert, but you could give Psycho a try:

http://www.ibm.com/developerworks/library/l-psyco.html

http://psyco.sourceforge.net/

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Thanks, I will try it with again. – YOU Nov 20 '09 at 08:51
  • You're welcome Mark. Note that it may very well be a bit trickier than simply adding a `psyco.jit()` at the top of your script: be sure to read the IBM tutorial. Best of luck! – Bart Kiers Nov 20 '09 at 08:54
  • @Bart - I see, I just tried with psyco.full(), ([a-z])(c) patterns become 25x previously 30x slower, on my current pc with 7.8M Characters. I am reading IBM tutorial now. thx. again. – YOU Nov 20 '09 at 09:03
2

So if I understand you correctly, you use a programming language that by default does not do just-in-time compiling and now you are looking for a regex library that does precisely that?

I think you should compile all of your python code to binary using e.g. Psyco

http://www.devshed.com/c/a/Python/How-Python-Runs-Programs/4/

also discussed here:

Is it feasible to compile Python to machine code?

and here:

Is it possible to compile Python natively (beyond pyc byte code)?

If these solutions either don't work or are still not fast enough and if you absolutely want to write the rest of your application in python, there is the boost python c++ library:

http://www.boost.org/doc/libs/1_41_0/libs/python/doc/index.html

The boost.python library allows full interoperability between python and c++. Then, you could use the boost.regex c++ regex matcher:

http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/html/index.html

Community
  • 1
  • 1
Sebastian
  • 4,802
  • 23
  • 48
  • thanks, I will try to test with psyco enabled again. – YOU Nov 20 '09 at 08:49
  • >> you are looking for a regex library that does precisely that. yes, kind of, I just tested with psyco, its does not improve that much actually, my idea is that, to translate [a-z]c to machine codes that do similar thing like hard-coding in C. thx – YOU Nov 20 '09 at 09:19
  • 1
    I've added the reference to boost.python and boost.regex. – Sebastian Nov 20 '09 at 09:43
2

The regular expression engine in Firefox compiles some (not all!) regular expressions to machine code. I believe Safari and Chrome do too.

Jason Orendorff
  • 42,793
  • 6
  • 62
  • 96
2

I don't see it in your question, so I ask: Did you test with precompiled regular expressions e.g. "re.compile(pattern)" ??

Since compiled regexes should be faster. OK, it is not JIT, but most of the time you are fine with simply precompiled ones!

See here:

re.compile

Juergen
  • 12,378
  • 7
  • 39
  • 55
  • Oh ok, I didnt realize that doing re.compile is faster, let me test that. – YOU Nov 20 '09 at 09:05
  • Seems like re.compile didnt help any improvement for me, btw, I am doing against 7.8M Characters with re.sub("([a-z])(c)","AA\\2\\1BB", not like I am doing re.compile in Million Loops. thx anyway – YOU Nov 20 '09 at 09:12
  • Oh, I see your problem now more clearly. – Juergen Nov 20 '09 at 09:34
  • 1
    But you should do `re.compile` **before** the loop, then run the compiled regex within the loop. – Daniel Roseman Nov 20 '09 at 09:41
  • >> But you should do re.compile before the loop, Ah, mine is not inside the loop, I meant I am not Looping Regex, the test data is just big like ~8 M characters, thx – YOU Nov 20 '09 at 09:45
  • 1
    The re module compiles and caches string regexes, so pre-compiling the regex often doesn't change the speed of the matching. – Ned Batchelder Nov 20 '09 at 19:21
2

Another idea: When you have a library (in C) that is more optimal than the Python regex module or that does just-in-time compilation of Regexes, then you could write your own regex module for python that does just wrap your C-Library.

That of course is somewhat more work and only recommended when you really, really need the speed.

You could also try Cython (personally I did not use it yet, but it sounds rather good) to do the job of wrapping.

As much as I understand your problem now, the Python surrounding is not your problem (so I doubt whether psyco will help) -- also the preparation of the regex-run is not your problem, but the run itself must be top-speed. That of course depends on the library you use and how good it can handle large strings. I would think, that the standard python regex-lib is not optimized for such long strings and top-of-the-notch speed.

Juergen
  • 12,378
  • 7
  • 39
  • 55
  • Thanks for the hints, Actually, I already wrapped my C codes to python and testing python regex and my hardcorded C code as python extension. – YOU Nov 20 '09 at 10:00
1

I may be wrong, but I believe that Python's regex module is in C, so any suggestion to compile Python (like using Psycho) would not make much difference---what you're actually comparing is the performance of one C regex library (Python's) with another (whatever library you are using).

Tommy McGuire
  • 1,223
  • 13
  • 16
  • Yeah, I am also thinking like that too. Actual codes are in sre.c If I am not wrong. – YOU Nov 20 '09 at 18:23
1

Thompson had a paper published in the communications of the ACM in 1968 that described a working JIT compiler for regular expressions into IBM 7094 code. I don't know what language(s) he used; Fortran or LISP would be the obvious suspects, with LISP being especially suspect since it already had JIT compiling.

Christopher Creutzig
  • 8,656
  • 35
  • 45