Regex performance

Question

I am benchmarking different approaches to RegEx and seeing something I really don't understand. I am specifically comparing using the -match operator vs using the [regex]::Matches() accelerator. I started with

(Measure-Command {
    foreach ($i in 1..10000) {
        $path -match $testPattern
    }
}).TotalSeconds
(Measure-Command {
    foreach ($i in 1..10000) {
        [regex]::Matches($path, $testPattern)
    }
}).TotalSeconds

and -match is always very slightly faster. But it's also not apples to apples because I need to assign the [Regex] results to a variable to use it. So I added that

(Measure-Command {
    foreach ($i in 1..10000) {
        $path -match $testPattern
    }
}).TotalSeconds
(Measure-Command {
    foreach ($i in 1..10000) {
        $test = [regex]::Matches($path, $testPattern)
    }
}).TotalSeconds

And now [Regex] is consistently slightly faster, which makes no sense because I added to the workload with the variable assignment. The performance difference is ignorable, 1/100th of a second when doing 10,000 matches, but I wonder what is going on under the hood to make [Regex] faster when there is a variable assignment involved?

For what it's worth, without the variable assignment -match is faster, .05 seconds vs .03 seconds. With variable assignment [Regex] is faster by .03 seconds vs .02 seconds. So while it IS all negligible, adding the variable cuts [Regex] processing time more than in half, which is a (relatively) huge delta.

Have you looked at precompiled regexes (e.g. ```$regex = [regex]::new($pattern, "Compiled, IgnoreCase, CultureInvariant"); $result = $regex.Matches($text)```? You can get a *significant* speedup in tight loops - testing locally here with a pattern over 1,000,000 iterations gives 3.07 seconds for your first example, but 0.0048 seconds with a compiled regex! — mclayton, Sep 05 '20 at 09:58
@mclayton Interesting. I tried a variation on what you did and I am not seeing such a big improvement. And since I will likely never see a single pattern used more than perhaps a few hundred times it doesn't matter massively. But still perhaps worth playing with. — Gordon, Sep 06 '20 at 17:59
I've added some more details in an answer below. In short, my test was flawed, but once corrected it still shows an improvement, although much more modest than I claimed! — mclayton, Sep 06 '20 at 19:41

Sage Pourpre · Accepted Answer · 2020-09-05T20:31:59.747

The outputs of both tests are different. The accelerator output a lot more text.

Even though they are not displayed when wrapped in the Measure-Command cmdlet, they are part of the calculation.

Output of $path -match $testPattern

$true

Output of [regex]::Matches($path,$testPattern


Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 0
Length   : 0
Value    :

Writing stuff is slow. In your second example, you take care of the accelerator output by assigning it to a variable. That's why it is significantly faster.

You can see the difference without assignment by voiding the outputs If you do that, you'll see the accelerator is consistently slightly faster.


(Measure-Command {
        foreach ($i in 1..10000) {
            [void]($path -match $testPattern)
        }
    }).TotalSeconds
(Measure-Command {
        foreach ($i in 1..10000) {
            [void]([regex]::Matches($path, $testPattern))
        }
    }).TotalSeconds

Additional note

void is always more efficient than Command | Out-null. Pipeline is slower but memory efficient.

Interesting. So, because I was comparing either assigning that larger output to a variable (faster) vs invisibly sending it to the console (slower) I got varying results. By sending both results to `[Void]` we get a consistent test. Good to know for the next thing I feel like testing. — Gordon, Sep 05 '20 at 08:56

score 3 · Answer 2 · answered Sep 06 '20 at 19:39

This isn't an answer to the direct question asked, but it's an expansion on the performance of pre-compiled regexes that I mentioned in comments...

First, here's my local performance benchmark for the original code in the question for comparison (with some borrowed text and patterns):

$text    = "foo" * 1e6;
$pattern = "f?(o)";
$count   = 1000000;

# example 1
(Measure-Command {
    foreach ($i in 1..$count) {
        $text -match $pattern
    }
}).TotalSeconds
# 8.010825

# example 2
(Measure-Command {
    foreach ($i in 1..$count) {
        $result = [regex]::Matches($text, $pattern)
    }
}).TotalSeconds
# 6.8186813

And then using a pre-compiled regex, which according to Compilation and Reuse in Regular Expressions emits a native assembly to process the regex rather than the default "sequence of internal instructions" - whatever that actually means :-).

$text    = "foo" * 1e6;
$pattern = "f?(o)";
$count   = 1000000;

# example 3
$regex = [regex]::new($pattern, "Compiled");
(Measure-Command {
    foreach ($i in 1..$count) {
       $result = $regex.Matches($text)
    }
}).TotalSeconds
# 5.8794981

# example 4
(Measure-Command {
    $regex = [regex]::new($pattern, "Compiled");
    foreach ($i in 1..$count) {
       $result = $regex.Matches($text)
    }
}).TotalSeconds
# 3.6616832

# example 5
# see https://github.com/PowerShell/PowerShell/issues/8976
(Measure-Command {
    & {
        $regex = [regex]::new($pattern, "Compiled");
        foreach ($i in 1..$count) {
            $result = $regex.Matches($text);
        }
    }
}).TotalSeconds
# 1.5474028

Note that Example 3 has a performance overhead of finding / resolving the $regex variable from inside each iteration because it's defined outside the Measure-Command's -Expresssion scriptblock - see https://github.com/PowerShell/PowerShell/issues/8976 for details.

Example 5 defines the variable inside a nested scriptblock and so is a lot faster. I'm not sure why Example 4 sits in between the two in performance, but it's useful to note there's a definite difference :-)

Also, as an aside, in my comments above, my original version of Example 5 didn't have the &, which meant I was timing the effort required to define the scriptblock, not execute it, so my numbers were way off. In practice, the performance increase is a lot less than my comment suggested, but it's still a decent improvement if you're executing millions of matches in a tight loop...

Woah, super interesting. If I understand that right, Powershell compile the nested scriptblock (I can't find a source for that) based on the loop size and lines count (20 or less) . — Sage Pourpre, Sep 06 '20 at 21:19
For example 3 & 4, I thought maybe the fact it was defined outside of the scope in example 3 might be what was hitting the performance (this is after all the only difference between both) but I get identical (practically speaking) results between the 2 on my end. — Sage Pourpre, Sep 06 '20 at 21:21
@SagePourpre - it's not PowerShell compiling anything - it's behaviour built into the .Net Regex class. If you call the constructor with the ```RegexOptions.Compiled``` flag (e.g. https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.-ctor?view=netcore-3.1#System_Text_RegularExpressions_Regex__ctor_System_String_System_Text_RegularExpressions_RegexOptions_) it emits an assembly that does the actual matching work. — mclayton, Sep 07 '20 at 09:05
I understood that. #4 & #5 both have the `[regex]::new($pattern, "Compiled");` part. Yet, #5 is significantly more performant. I was theorizing that what I was witnessing was the JIT compiler in action. From unofficial sources, I saw a couple of time this : JIT compiler might compile PS script lesser than 300 lines when a loop bigger than 16 times is detected. My thought was that this mechanism might be the reason why #5 was faster (While both #4 & #5 have the same loop, possibly only the scriptblock was evaluated for JIT compilation. — Sage Pourpre, Sep 07 '20 at 10:30
Maybe that make no sense... I am unable to find official sources for it. I saw that a couple of time at random place, including Powershell Conference Volume. Do you have a different explanation for why #5 is 2 time faster ? If so, I'd be curious to hear about it. — Sage Pourpre, Sep 07 '20 at 10:32
@SagePourpre - see https://stackoverflow.com/a/54776311/3156906 - "you can speed up side-effect-free expressions by simply invoking them via & { ... }" — mclayton, Sep 07 '20 at 10:48

Regex performance

2 Answers2