My Questions is
Is there any regular expression engine that does Just-In-Time compiling during regex pattern parsing and use when matching / replacing the texts? Or where can I learn JIT for i386 or x64 architecture?
Why I need it
I was recently trying to benchmark Python’s built-in regex engine compared with normal C code with around 10MB of data.
I found that for a straightforward replacing (for example ab
to zzz
) it’s relatively fast: just 2 to 3 times slower than C.
But for [a-z]c
it took around 5 to 8 times as much time as C.
And with grouping (e.g. ([a-z])(c)
to AA\2\1BB
) it took 20 to 40 times as much time as C.
It’s not Just-In-Time compiling yet, but I think, if I could do Just-In-Time compiling, It could speed up a lot more.
PS: I use profiling for each regex pattern during compiling patterns,
for example, profile 1 for simple one like ab
, profile 2 for range [a-z]c
, profile 3 with grouping ([a-z])(c)
, each profile has separate codes, so no extra cost needed when matching and replacing simple patterns.
Update 1
I have tried it with psyco, and it doesn’t improve the speed that much. May be because I am doing text replacing against big data, not looping many times.
If I am not wrong, Python’s re.sub
is running it natively already I think, so pysco cannot improve the speed that much.
Update 2
I have tried with boost regex wrapped into python, but it’s even slower than Python’s regex, so it seems like the bottleneck is in Python’s string processing and Jan Goyvaerts has also pointed me to that in the answer.
Update
I’d like to convert regex pattern ab[a-z]c
to machine code, like the following equivalent C code (*s
points to 10MB long texts):
do{
if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;
Any ideas?