18

Every time I have to do simple containment or replacement operations on strings, where the term that I'm searching for is a fixed value, I find that if I take my sample input and do some profiling on it, using a compiled regular expression is nearly* always faster than using the equivalent method from the String class.

I've tried comparing a variety of methods ( hs is the "haystack" to search, ndl is the "needle" to search for, repl is the replacement value. regex is always created with the RegexOptions.Compiled option ):

  • hs.Replace( ndl, repl ) vs regex.Replace( hs, repl )
  • hs.Contains( ndl ) vs regex.IsMatch( hs )

I've found quite a few discussions focusing on which of the two techniques are faster (1, 2, 3, and loads of others), but those discussions always seem to focus on:

  1. Use the string version for simple operations and regex for complex operations (which, from a raw performance perspective, doesn't even seem to be necessarily a good idea), or
  2. Run a test and compare the two ( and for equivalent tests, the regex version seems to always perform better ).

I don't understand how this can possibly be the case: how does the regex engine compare any two strings for substring matches faster than the equivalent string version? This seems to hold true for search spaces that are very small or very large, or search terms that are small or large, or whether the search term occurs early or late in the search space.

So, why are regular expressions faster?


* In fact, the only case I've managed to show that the string version is faster than a compiled regex is when searching an empty string! Any other case, from single character strings to very long strings are processed faster by a compiled regex than the equivalent string method.


Update: Added a clause to clarify that I'm looking at cases where the search term is known at compile time. For dynamic or one-time operations, the overhead of compiling the regular expression will tend to skew the results in favor of the string methods.

Community
  • 1
  • 1
Chris Phillips
  • 11,607
  • 3
  • 34
  • 45
  • I believe your better question might be "If regex is faster, why don't the String methods just use regex?". Anyway, my only guess would be something involving either that 1. the regex compiling takes some additional time you're not accounting for, or 2. regex compiling/processing eats more memory or something, and you're getting a space-time tradeoff. – zebediah49 Sep 14 '12 at 16:54
  • @zebediah49 I think you're right that my question leads into that question - but I'm really interested in the why or how regexes are performing better. – Chris Phillips Sep 14 '12 at 16:56
  • Where are the close votes coming from? – Chris Phillips Sep 14 '12 at 17:30
  • 1
    @chris, the close votes are coming from the pharisees and scribes. The true vision of go_...uh, I mean Joel ... was that the rules would be used the make the site more useful, not less. – FastAl Sep 14 '12 at 17:55
  • @zebediah49: I think Chris' edit answers that: string methods must work equally well for dynamic one-time operations, where the overhead of creating a Boyer Moore table and compiling it would be much more expensive than a simple linear search – Niki Sep 14 '12 at 18:14

3 Answers3

14

I don't understand how this can possibly be the case: how does the regex engine compare any two strings for substring matches faster than the equivalent string version?

I can think of two reasons:

  1. The regex is using some smart algorithm like Boyer Moore (O(M/N)) while the simple string operation simply compares the needle to each position in the haystack (O(N*M)).
  2. They're not really doing the same thing. For example, one might do culture-invariant matching while the other does culture-dependent matching, which might make a performance difference.
Erre Efe
  • 15,387
  • 10
  • 45
  • 77
Niki
  • 15,662
  • 5
  • 48
  • 74
6

As the Base Class Library team wrote:

In [the case of RegexOptions.Compiled], we first do the work to parse into opcodes. Then we also do more work to turn those opcodes into actual IL using Reflection.Emit. As you can imagine, this mode trades increased startup time for quicker runtime: in practice, compilation takes about an order of magnitude longer to startup, but yields 30% better runtime performance.

But, you're overloking one important thing: Pattern is fixed. Be aware that this isn't always the case. You can't change it at runtime! There will be cases in which flexibility will go down for more than the 30% of the performance gain.

Erre Efe
  • 15,387
  • 10
  • 45
  • 77
  • That's a great link, thanks! I'm aware that the pattern is fixed -- I updated the question after zebediah49's comment to try and make that more clear. – Chris Phillips Sep 14 '12 at 17:26
1

From my own experience, whenever I've benchmarked regex solutions vs custom parsers with simple string operations, the regex solution is a bit slower. That's because they often suffer from backtracking.

But out of curiosity I wrote a little test to see if what you say really is true in the simplest of examples.

Here's my test...

    private static Regex RegexSearch = new Regex("Audi A4", RegexOptions.Compiled);

    static void Main(string[] args)
    {            
        // warm up the JIT compiler
        FoundWithRegexIsMatch();
        FoundWithStringContains();
        FoundWithStringIndexOf();
        // done warming up

        int iterations = 100;
        var sw = new Stopwatch();

        sw.Restart();
        for (int i = 0; i < iterations; i++)
        {
            FoundWithRegexIsMatch();
        }
        sw.Stop();
        Console.WriteLine("Regex.IsMatch: " + sw.ElapsedMilliseconds + " ms");


        sw.Restart();
        for (int i = 0; i < iterations; i++)
        {
            FoundWithStringIndexOf();
        }
        sw.Stop();
        Console.WriteLine("String.IndexOf: " + sw.ElapsedMilliseconds + " ms");


        sw.Restart();
        for (int i = 0; i < iterations; i++)
        {
            FoundWithStringContains();
        }
        sw.Stop();
        Console.WriteLine("String.Contains: " + sw.ElapsedMilliseconds + " ms");
    }

    private static bool FoundWithStringContains()
    {
        return Resource2.csv.Contains("Audi A4");
    }

    private static bool FoundWithStringIndexOf()
    {
        return Resource2.csv.IndexOf("Audi A4") >= 0;
    }

    private static bool FoundWithRegexIsMatch()
    {
        return RegexSearch.IsMatch(Resource2.csv);
    }

I'm testing against a CSV file I have which is about 1.2 MB. And here are the results:

  • Regex.IsMatch: 172 ms
  • String.IndexOf: 172 ms
  • String.Contains: 185 ms

Indeed you are correct. When you're not doing anything fancy in the regular expression, the regular expression is a bit faster than a String.Contains operation. However, I found that there's a tiny bit of overhead in String.Contains and calling String.IndexOf is faster and actually matches the time of Regex.IsMatch to the millisecond.

These identical timings are suspicious. I wonder if during compilation it's determined that this doesn't actually need to go through the Regex engine (since there are no regex-specific instructions in the pattern I used above). Instead it may be simplified so that Regex.IsMatch makes the same call to FindNLSString from kernel32.dll as IndexOf does. That's just a guess.

Community
  • 1
  • 1
Steve Wortham
  • 21,740
  • 5
  • 68
  • 90
  • Interesting. I didn't actually test IndexOf with my recent batch of tests -- it's not the (semantically) intuitive method for these types of operations. – Chris Phillips Sep 14 '12 at 20:39
  • @ChrisPhillips - I agree. It looks like the main thing making `Contains` slower is its use of `StringComparison.Ordinal`. You can add that to the `IndexOf` example I wrote and get a similar time. – Steve Wortham Sep 14 '12 at 21:01
  • @ChrisPhillips - I did a few more tests and found more situations where `Regex.IsMatch` is faster than even a call to `String.IndexOf`. So I don't think my theory could be correct. – Steve Wortham Sep 17 '12 at 22:08
  • Your regex isn't really a regex but just a string. Something like "[Aa]udi A\d" should find the same thing but in a more regex-y way – Sten Petrov Apr 21 '17 at 18:04