190

What is going on behind the scenes when you mark a regular expression as one to be compiled? How does this compare/is different from a cached regular expression?

Using this information, how do you determine when the cost of computation is negligible compared to the performance increase?

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
Bob
  • 97,670
  • 29
  • 122
  • 130
  • good resource on Regex best practices: https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices – CAD bloke Mar 20 '20 at 19:56

5 Answers5

333

RegexOptions.Compiled instructs the regular expression engine to compile the regular expression expression into IL using lightweight code generation (LCG). This compilation happens during the construction of the object and heavily slows it down. In turn, matches using the regular expression are faster.

If you do not specify this flag, your regular expression is considered "interpreted".

Take this example:

public static void TimeAction(string description, int times, Action func)
{
    // warmup
    func();

    var watch = new Stopwatch();
    watch.Start();
    for (int i = 0; i < times; i++)
    {
        func();
    }
    watch.Stop();
    Console.Write(description);
    Console.WriteLine(" Time Elapsed {0} ms", watch.ElapsedMilliseconds);
}

static void Main(string[] args)
{
    var simple = "^\\d+$";
    var medium = @"^((to|from)\W)?(?<url>http://[\w\.:]+)/questions/(?<questionId>\d+)(/(\w|-)*)?(/(?<answerId>\d+))?";
    var complex = @"^(([^<>()[\]\\.,;:\s@""]+"
      + @"(\.[^<>()[\]\\.,;:\s@""]+)*)|("".+""))@"
      + @"((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"
      + @"\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+"
      + @"[a-zA-Z]{2,}))$";


    string[] numbers = new string[] {"1","two", "8378373", "38737", "3873783z"};
    string[] emails = new string[] { "sam@sam.com", "sss@s", "sjg@ddd.com.au.au", "onelongemail@oneverylongemail.com" };

    foreach (var item in new[] {
        new {Pattern = simple, Matches = numbers, Name = "Simple number match"},
        new {Pattern = medium, Matches = emails, Name = "Simple email match"},
        new {Pattern = complex, Matches = emails, Name = "Complex email match"}
    })
    {
        int i = 0;
        Regex regex;

        TimeAction(item.Name + " interpreted uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        i = 0;
        TimeAction(item.Name + " compiled uncached single match (x1000)", 1000, () =>
        {
            regex = new Regex(item.Pattern, RegexOptions.Compiled);
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern);
        i = 0;
        TimeAction(item.Name + " prepared interpreted match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

        regex = new Regex(item.Pattern, RegexOptions.Compiled);
        i = 0;
        TimeAction(item.Name + " prepared compiled match (x1000000)", 1000000, () =>
        {
            regex.Match(item.Matches[i++ % item.Matches.Length]);
        });

    }
}

It performs 4 tests on 3 different regular expressions. First it tests a single once off match (compiled vs non compiled). Second it tests repeat matches that reuse the same regular expression.

The results on my machine (compiled in release, no debugger attached)

1000 single matches (construct Regex, Match and dispose)

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |    4 ms        |    26 ms           |    31 ms
Interpreted | x64      |    5 ms        |    29 ms           |    35 ms
Compiled    | x86      |  913 ms        |  3775 ms           |  4487 ms
Compiled    | x64      | 3300 ms        | 21985 ms           | 22793 ms

1,000,000 matches - reusing the Regex object

Type        | Platform | Trivial Number | Simple Email Check | Ext Email Check
------------------------------------------------------------------------------
Interpreted | x86      |  422 ms        |   461 ms           |  2122 ms
Interpreted | x64      |  436 ms        |   463 ms           |  2167 ms
Compiled    | x86      |  279 ms        |   166 ms           |  1268 ms
Compiled    | x64      |  281 ms        |   176 ms           |  1180 ms

These results show that compiled regular expressions can be up to 60% faster for cases where you reuse the Regex object. However in some cases can be over 3 orders of magnitude slower to construct.

It also shows that the x64 version of .NET can be 5 to 6 times slower when it comes to compilation of regular expressions.


The recommendation would be to use the compiled version in cases where either

  1. You do not care about object initialization cost and need the extra performance boost. (note we are talking fractions of a millisecond here)
  2. You care a little bit about initialization cost, but are reusing the Regex object so many times that it will compensate for it during your application life cycle.

Spanner in the works, the Regex cache

The regular expression engine contains an LRU cache which holds the last 15 regular expressions that were tested using the static methods on the Regex class.

For example: Regex.Replace, Regex.Match etc.. all use the Regex cache.

The size of the cache can be increased by setting Regex.CacheSize. It accepts changes in size any time during your application's life cycle.

New regular expressions are only cached by the static helpers on the Regex class. If you construct your objects the cache is checked (for reuse and bumped), however, the regular expression you construct is not appended to the cache.

This cache is a trivial LRU cache, it is implemented using a simple double linked list. If you happen to increase it to 5000, and use 5000 different calls on the static helpers, every regular expression construction will crawl the 5000 entries to see if it has previously been cached. There is a lock around the check, so the check can decrease parallelism and introduce thread blocking.

The number is set quite low to protect yourself from cases like this, though in some cases you may have no choice but to increase it.

My strong recommendation would be never pass the RegexOptions.Compiled option to a static helper.

For example:

// WARNING: bad code
Regex.IsMatch("10000", @"\\d+", RegexOptions.Compiled)

The reason being that you are heavily risking a miss on the LRU cache which will trigger a super expensive compile. Additionally, you have no idea what the libraries you depend on are doing, so have little ability to control or predict the best possible size of the cache.

See also: BCL team blog


Note : this is relevant for .NET 2.0 and .NET 4.0. There are some expected changes in 4.5 that may cause this to be revised.

Quirin F. Schroll
  • 1,302
  • 1
  • 11
  • 25
Sam Saffron
  • 128,308
  • 78
  • 326
  • 506
  • 12
    Great answer. For my own purposes I often use `Compiled` in website code where I'm actually storing a static (application-wide) `Regex` object. So the `Regex` only has to be constructed once when IIS starts the application, and then is reused thousands of times. This works well as long as the application doesn't restart frequently. – Steve Wortham Mar 17 '12 at 19:53
  • 2
    W00! This information helped me speed my process up from 8-13 hours, to ~30 minutes. Thank you! – Robert Christ Jul 25 '14 at 20:46
  • 3
    Great answer Sam, can you update about what has changed in version > 4.5 ? (I know you changed your stack a while back...) – gdoron Feb 08 '17 at 12:40
  • @gdoronissupportingMonica There has been some Regex performance improvements on NET 5.0. I saw a blog post for this. You can check it out [here](https://devblogs.microsoft.com/dotnet/regex-performance-improvements-in-net-5/) – kapozade May 12 '20 at 09:34
  • As a performance architect, I strongly disagree with the ending assertion you've made here. The mechanical sympathy of what's happening on your machine when you do this test seems to be lost. What you're talking about is completely saturating core(s) in doing so. If, in theory, I could dedicate an entire CPU to just doing regexes then I'd 100% agree. Since this is never the case, I strongly disagree. Always compile regular expressions or suffer the performance consequences. – Jeff Fischer Mar 31 '23 at 22:43
50

This entry in the BCL Team Blog gives a nice overview: "Regular Expression performance".

In short, there are three types of regex (each executing faster than the previous one):

  1. interpreted

    fast to create on the fly, slow to execute

  2. compiled (the one you seem to ask about)

    slower to create on the fly, fast to execute (good for execution in loops)

  3. pre-compiled

    create at compile time of your app (no run-time creation penalty), fast to execute

So, if you intend to execute the regex only once, or in a non-performance-critical section of your app (i.e. user input validation), you are fine with option 1.

If you intend to run the regex in a loop (i.e. line-by-line parsing of file), you should go with option 2.

If you have many regexes that will never change for your app and are used intensely, you could go with option 3.

KyleMit
  • 30,350
  • 66
  • 462
  • 664
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 1
    [Number 3 could be made easy through a custom roslyn `CompileModule`](https://github.com/dotnet/corefx/issues/340#issuecomment-68572335). Damn, I need to take a deeper look into the new plattform. – Christian Gollhardt Oct 13 '17 at 18:43
  • You have not made clear that your statement "if you intend to execute the regex only once, or in a non-performance-critical section of your app (i.e. user input validation)" is predicated on low throuhput scenarios. You should have said "If you have very low throughput and if you intend to execute...". If it's high throughput, you never want to go with option 1. – Jeff Fischer Mar 31 '23 at 22:48
9

It should be noted that the performance of regular expressions since .NET 2.0 has been improved with an MRU cache of uncompiled regular expressions. The Regex library code no longer reinterprets the same un-compiled regular expression every time.

So there is potentially a bigger performance penalty with a compiled and on the fly regular expression. In addition to slower load times, the system also uses more memory to compile the regular expression to opcodes.

Essentially, the current advice is either do not compile a regular expression, or compile them in advance to a separate assembly.

Ref: BCL Team Blog Regular Expression performance [David Gutierrez]

KyleMit
  • 30,350
  • 66
  • 462
  • 664
Robert Paulson
  • 17,603
  • 5
  • 34
  • 53
3

This does not answer the question but I recommend doing this:

[GeneratedRegex($@"MyPatter")]
public partial Regex Regex_SomeRegex();

That way you get best of both worlds. It will be fast at initialization because its created at compile time. And it will also be fast when using it.

xhafan
  • 2,140
  • 1
  • 26
  • 26
Tono Nam
  • 34,064
  • 78
  • 298
  • 470
  • 1
    More background infos: https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/#source-generation – n0099 Feb 16 '23 at 22:26
  • GeneratedRegex looks like the way to go from .NET 7, even Resharper recommends refactoring Regex code into this. – xhafan May 04 '23 at 10:47
2

Here's some further reading

  1. Base Class Library Team on compiled regex

  2. Coding Horror, referencing #1 with some good points on the tradeoffs

KyleMit
  • 30,350
  • 66
  • 462
  • 664
Rob McCready
  • 1,909
  • 1
  • 19
  • 20