0

I have a string

aaaa bbbb cccc=v1

I want to capture cccc=v1 (the field_value pair, not the exact "cccc"). To improve the performance I used atomic groups so that it should not waste time backtracking if = not found

\b[^\s=]++=\w+

But what happen is even though it does not back track, it check each character in string like following

aaaa bbbb cccc=v1
^
aaaa bbbb cccc=v1
 ^
aaaa bbbb cccc=v1
  ^
aaaa bbbb cccc=v1
   ^

In this case, Can I skip the matching when the atomic group was not captured. something like following

aaaa bbbb cccc=v1
^
aaaa bbbb cccc=v1
     ^
aaaa bbbb cccc=v1
          ^

I think it definitely should improve the performance.

JohnyL
  • 6,894
  • 3
  • 22
  • 41
Jay Joshi
  • 1,402
  • 1
  • 13
  • 32
  • What environment are you in? – CertainPerformance Dec 22 '18 at 04:17
  • I don't see any way around checking each term to see if it might contain `=`, in which case it would qualify as a key value pair. – Tim Biegeleisen Dec 22 '18 at 04:19
  • It's a software in which I have to provide a regex. So.. I think it should be python based regex. – Jay Joshi Dec 22 '18 at 04:20
  • This is premature optimization. – dawg Dec 22 '18 at 04:20
  • @dawg, I am sorry I am not aware of regex much. I just though if I could skip the atomic group it should improve the performance. But if it's not possible I can totally understand. – Jay Joshi Dec 22 '18 at 04:22
  • I agree with @dawg, unless your inputs are enormous there's no need to optimize this (especially if you're using Python anyway). If you do need this to run quickly on large inputs, regex isn't the right tool. – robinsax Dec 22 '18 at 04:22
  • Believe me the input i am getting is too big. It took more than 60 seconds for a regex to run on 1 million text events. That's why regex matters a lot in the software. The software I am talking about is a log monitoring tool and uses regex to extract the fields. I hope I am making sense – Jay Joshi Dec 22 '18 at 04:24
  • I would do `(?<=[ ])(\w+=\w+)$` and call it a day. Lookbacks are quite efficient and are Perl and Python compatible. Anchors also increase speed. [DEMO](https://regex101.com/r/FXd0Ra/1) – dawg Dec 22 '18 at 04:25
  • Are the substrings that don't contain the `=`s always word characters, eg the `aaaa` and `bbbb` parts? – CertainPerformance Dec 22 '18 at 04:30
  • @CertainPerformance, Yes.!! That's the exact understanding.!! words which are not fields are always a word. like `aaaa` and `bbbb`. – Jay Joshi Dec 22 '18 at 04:32

2 Answers2

2

One option to match all of the characters in a non-= word at once would be to use (?:\w+ )* at the beginning of the pattern. (If an = word is not guaranteed, make this possessive, to prevent backtracking.) Then, use \K to forget about the text previously matched, and match the = word with [^\s=]++=\w+:

^(?:\w+ )*+\K[^\s=]+=\w+

https://regex101.com/r/RVogoh/5

Still, it's only a moderate improvement when the entire string to search is small - 63 steps, compared to the original

https://regex101.com/r/RVogoh/1/

which takes 90 steps. This implementation becomes significantly more efficient than the original character-by-character test only when there are many characters.

Note that \K is not supported in the re module - for that, you'll have to use pypi's regex module.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • If it is a Python regex, `\K` and atomic groups are not supported. – dawg Dec 22 '18 at 04:43
  • Yep, see the last line - need to use the `regex` module – CertainPerformance Dec 22 '18 at 04:44
  • Enclosing `(?:\b\w+ )*` in an atomic group or making it possessive (i.e. `(?:\b\w+ )*+`) would be a wise step to take in order to prevent unnecessary backtracks on failures ([this](https://regex101.com/r/RVogoh/3) vs [this](https://regex101.com/r/RVogoh/4)). – revo Dec 22 '18 at 05:52
  • @revo Good point, thanks, and I can remove the `\b` from inside the repeated group too – CertainPerformance Dec 22 '18 at 06:28
  • You're welcome, right and you could remove possessive quantifier on `[^\s=]++` safely too as `=` is excluded from `\S`. – revo Dec 22 '18 at 06:37
0

Updated

I think you will find that most sensible regex's will be reasonably performant on find the key=value pair at the end of the line. (Even with long lines and partial matches.)

Here are some timings. I have used the cmpthese function from this SO post to compare relative timing:

import re 
import regex 

def f1():
    # re from my comment
    return re.findall(r'(?<=[ ])(\w+=\w+)$', txt, flags=re.M)

def f2():
    # the OP's regex
    return regex.findall(r'\b([^\s=]++=\w+)', txt, flags=re.M)

def f3():
    # alternate regex 
    return re.findall(r'(\w+=\w+)$', txt, flags=re.M)   

def f4():
    # CertainPerformance updated regex
    return regex.findall(r'^(?:\w+ )*+\K[^\s=]+=\w+', txt, flags=regex.M)   

def f5():
    return [line.split()[-1] for line in txt.splitlines() if re.match(r'^(\w+=\w+)$', line.split()[-1])]    

txt='''\
a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a bc d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d e=
aaaa bbbb cccc=v1
aaaa bbbb cccc
aaaa bbbb cccc=v1
'''*1000000


cmpthese([f1,f2,f3,f4,f5],c=3)  

This prints on Python 2 (slowest on top, fastest on the bottom):

   rate/sec    usec/pass     f2     f4     f1     f3     f5
f2        0 36721115.669     -- -27.2% -72.0% -72.0% -77.5%
f4        0 26715482.632  37.5%     -- -61.4% -61.5% -69.0%
f1        0 10300210.953 256.5% 159.4%     --  -0.0% -19.6%
f3        0 10296802.362 256.6% 159.5%   0.0%     -- -19.6%
f5        0  8280366.262 343.5% 222.6%  24.4%  24.4%     --

And Python 3:

   rate/sec    usec/pass     f2     f4     f3     f1     f5
f2        0 40880883.330     -- -42.3% -64.4% -70.3% -78.3%
f4        0 23592684.768  73.3%     -- -38.4% -48.6% -62.3%
f3        0 14544536.920 181.1%  62.2%     -- -16.6% -38.9%
f1        0 12131648.781 237.0%  94.5%  19.9%     -- -26.7%
f5        0  8888514.997 359.9% 165.4%  63.6%  36.5%     --

I believe the slowness of f2 and f4 are more likely from using the regex module vs the re module but the regex in those functions require using the regex module. The regex in f4 under an apples to apples comparison should be fast.

You can see that adding a look behind anchor slightly increases the speed vs the others using the re module. The regex module is more likely to be the culprit for f4 being slower than the others. In theory, that is a faster regex in Perl for example.


The comments and 'performance estimate' focus only the number of 'steps' in regex101. This is an incomplete picture of relative performance of different regex expressions. Regex101 also has a ms rating for the time necessary to complete the regex -- which is server land dependent. Certain regex steps are faster than others.

Consider the regex (?<=[ ]) In regex101, in this example, it takes 205 steps and ~2ms at the moment it was run.

Now consider the simpler regex of [ \t] It takes 83 steps but the same ~2ms to run.

Now consider the more complex regex of (\w+)\1\b While it is 405 steps, it takes almost 5x longer to run.

While the steps are an indicator of regex speed, not each steps takes the same time to execute. You also need to look at total execution time.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • Regexes performance should be both benchmarked on desired and undesired input strings for a real reasoning. I'd like you to run the same benchmark on the input string from here https://regex101.com/r/RVogoh/6 (and update the regex from CertainPerformance) – revo Dec 22 '18 at 07:20
  • @revo: Post updated. Steps are an incomplete indicator of performance. While a loopback seems to have more 'steps' than an alternate, it is also fixed strings so faster steps vs slower steps like backtracking. **Not all steps, are equal...** – dawg Dec 22 '18 at 17:05
  • *it takes 205 steps and ~2ms...* if you refresh the page you may see 0ms. I personally don't care about these times and steps. Have you benchmarked `(?<=[ ])(\w+=\w+)$` with `regex` module? – revo Dec 22 '18 at 17:09
  • @revo: *Have you benchmarked (?<=[ ])(\w+=\w+)$ with regex module?* -- just did, similar results. – dawg Dec 22 '18 at 17:15