Updated
I think you will find that most sensible regex's will be reasonably performant on find the key=value
pair at the end of the line. (Even with long lines and partial matches.)
Here are some timings. I have used the cmpthese
function from this SO post to compare relative timing:
import re
import regex
def f1():
# re from my comment
return re.findall(r'(?<=[ ])(\w+=\w+)$', txt, flags=re.M)
def f2():
# the OP's regex
return regex.findall(r'\b([^\s=]++=\w+)', txt, flags=re.M)
def f3():
# alternate regex
return re.findall(r'(\w+=\w+)$', txt, flags=re.M)
def f4():
# CertainPerformance updated regex
return regex.findall(r'^(?:\w+ )*+\K[^\s=]+=\w+', txt, flags=regex.M)
def f5():
return [line.split()[-1] for line in txt.splitlines() if re.match(r'^(\w+=\w+)$', line.split()[-1])]
txt='''\
a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a bc d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d a b c d e=
aaaa bbbb cccc=v1
aaaa bbbb cccc
aaaa bbbb cccc=v1
'''*1000000
cmpthese([f1,f2,f3,f4,f5],c=3)
This prints on Python 2 (slowest on top, fastest on the bottom):
rate/sec usec/pass f2 f4 f1 f3 f5
f2 0 36721115.669 -- -27.2% -72.0% -72.0% -77.5%
f4 0 26715482.632 37.5% -- -61.4% -61.5% -69.0%
f1 0 10300210.953 256.5% 159.4% -- -0.0% -19.6%
f3 0 10296802.362 256.6% 159.5% 0.0% -- -19.6%
f5 0 8280366.262 343.5% 222.6% 24.4% 24.4% --
And Python 3:
rate/sec usec/pass f2 f4 f3 f1 f5
f2 0 40880883.330 -- -42.3% -64.4% -70.3% -78.3%
f4 0 23592684.768 73.3% -- -38.4% -48.6% -62.3%
f3 0 14544536.920 181.1% 62.2% -- -16.6% -38.9%
f1 0 12131648.781 237.0% 94.5% 19.9% -- -26.7%
f5 0 8888514.997 359.9% 165.4% 63.6% 36.5% --
I believe the slowness of f2
and f4
are more likely from using the regex
module vs the re
module but the regex in those functions require using the regex
module. The regex in f4
under an apples to apples comparison should be fast.
You can see that adding a look behind anchor slightly increases the speed vs the others using the re
module. The regex
module is more likely to be the culprit for f4
being slower than the others. In theory, that is a faster regex in Perl
for example.
The comments and 'performance estimate' focus only the number of 'steps' in regex101. This is an incomplete picture of relative performance of different regex expressions. Regex101 also has a ms
rating for the time necessary to complete the regex -- which is server land dependent. Certain regex steps are faster than others.
Consider the regex (?<=[ ])
In regex101, in this example, it takes 205 steps and ~2ms at the moment it was run.
Now consider the simpler regex of [ \t]
It takes 83 steps but the same ~2ms to run.
Now consider the more complex regex of (\w+)\1\b
While it is 405 steps, it takes almost 5x longer to run.
While the steps
are an indicator of regex speed, not each steps takes the same time to execute. You also need to look at total execution time.