13

I need to build a Regex (.NET syntax) to determine if a string ends with a specific value. Specifically I need to test whether a file has a specific extension (or set of extensions).

The code I'm trying to fix was using:

.*\.(png|jpg|gif)$

which is hideously slow for failed matches in my scenario (presumably due to the backtracking.

Simply removing the .* (which is fine since the API only tests for matches and doesn't extract anything) at the beginning makes the regex much more efficient.

It still feels like it is pretty inefficient. Am I missing something obvious here?

Unfortunately, I don't control the API in question so I need a regex to do this even though I wouldn't normally consider regex to be the right tool for the job.

I also did some tests using the RegexOptions.RightToLeft and found that I could squeeze a little more performance out of my test case with ^.*\.(png|jpg|gif)$, but I can't find a way to specify the RightToLeft option within the string of the regex itself so I don't think I can use it.

Bruno Reis
  • 37,201
  • 11
  • 119
  • 156
  • 1
    Of what magnitude is the slowness of `\.(png|jpg|gif)$`, because I think that that is the simplest you will get with regular expressions. The rest is up to the implementation. – ase Jan 17 '10 at 16:04
  • 2
    Why not just loop through the array of the valid extensions and use [`.EndsWith`](http://msdn.microsoft.com/en-us/library/system.string.endswith(VS.71).aspx)? Simple string comparison is much faster than a regex engine. – kennytm Jan 17 '10 at 16:10
  • How fast does it need to be? In Ruby, one of the slowest languages around, I can compare 348,000 paths per second. I'd pick the simpler regexp not because it's faster, but because it's easier to read and understand. – Wayne Conrad Jan 17 '10 at 16:30
  • @Wayne: It's easier to read and understand *only if* you're forever locked to a few extensions. What if later you want to support `bmp`? `tif`? `tiff`? `apng`? Or to correct that `jpeg` is actually also a widely used JPEG extension? Your regex soon grow into a 1-liner monster that's hard to `diff`, my array is still maintainable and understandable. – kennytm Jan 17 '10 at 19:04
  • @Kenny, good point. In Ruby, regular expressions can be built out of strings quite easily. That certainly colors my perception of what's easy and what's hard. starting with an array of extensions, one can get to a regular expression as simply as this: Regexp.new('.' + extensions.join('|') + '$'). If a regexp can be made out of a string in C#, then the same technique (different syntax) can be used. – Wayne Conrad Jan 17 '10 at 21:56
  • I notice that the second regex in your question starts with the `^` anchor while the first one doesn't. Have you in fact tried it both ways? I would expect the anchored version to fail much more quickly when no match is possible. – Alan Moore Jan 18 '10 at 02:29

5 Answers5

16

I don't have access to C# so I can't try this... but you should be able to avoid too much backtracking by forcing the engine to find the end of the string first, then matching the extensions:

$(?<=\.(gif|png|jpg))

I'm not sure of the effect the look-behind has on performance, though.

Blixt
  • 49,547
  • 13
  • 120
  • 153
  • From my preliminary testing, this is about 2x faster than the approach I was using and roughly on par with the RightToLeft matching I was doing in my test app. I guess that makes sense because it is essentially manually writing the expression as a RightToLeft – NowYouHaveTwoProblems Jan 17 '10 at 20:03
  • Okay, good to know! I'm pretty sure some regular expression engines do optimizations in the compilation stage when they see the `$` anchor, making sure to start checking at the end of the string, but obviously the .NET regular expression engine does not. – Blixt Jan 18 '10 at 09:58
  • 8
    I noticed someone downvoted my answer, but didn't care to explain why... That kind of makes the downvote pointless, doesn't it? Please always explain why you downvote something. – Blixt Jan 18 '10 at 14:33
7

Really, you could also just drop Regex altogether, and use String.EndsWidth, with the following :

var extensions = new String[] { ".png", ".jpg", ".gif" };
extensions.Any(ext => "something".EndsWith(ext));

I usually have the feeling that it ends up being faster to use simple string functions for cases like this rather than trying to find a clever way to use an efficient regex, in terms of runtime and/or development time, unless you are comfortable with and know what is efficient in terms of Regex.

Dynami Le Savard
  • 4,986
  • 2
  • 26
  • 22
4

Make it look specifically for a period instead of any character preceding the extension:

\.(png|jpg|gif)$

This will make it safer (won't match x.xgif) and it will not have to do any backtracking at all until it found a period (as opposed to backtracking on every character).

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
2

If you can change the code, why can't you use something else? You don't control the API, right, but you are changing it anyway. This I really don't understand.

Anyway, why not simply:

var AcceptedExtensions = new List<string>() { "txt", "html", "htm" };
var extension = filename.Substring(filename.LastIndexOf(".") + 1).ToLower();
return AcceptedExtensions.Contains(extension);

The IEnumerable AcceptedExtensions would be loaded from some config, the same way you load your jpg|gif|.... Or it would be a constant, whatever. You just don't need to recreate it each time you are going to use it (I doubt that this would be a bottleneck though).

Bruno Reis
  • 37,201
  • 11
  • 119
  • 156
  • 3
    Yeah that's the obvious way to do it, but I'm pretty sure he's stuck with an API that takes a string that is handled as a regular expression. – Blixt Jan 17 '10 at 16:20
  • True, and I would love to be able to use a string-based approach rather than regex, but I don't have that option in this case. – NowYouHaveTwoProblems Jan 17 '10 at 19:59
  • Why do you say you don't have the option? One option is the code above. Why can't you use it? – Bruno Reis Jan 17 '10 at 20:22
  • 2
    @Bruno Reis: Imagine he has to use an API like: `processFiles()` and the processFiles() fuction is supplied to him in a binary dll. – slebetman Jan 18 '10 at 02:04
  • @slebetman, this seems the only possible answer to me. But it is still bizarre. If we knew more precisely what the restrictions are, maybe a better solution could emerge! – Bruno Reis Jan 18 '10 at 02:07
  • 2
    @Bruno Reis: We know precisely what the restrictions are. He said it in his post: `"I don't control the API in question so I need a regex to do this even though I wouldn't normally consider regex to be the right tool for the job"`. That means he needs to call a function that accepts a string (why not a regexp object I have no idea) and interprets that string as a regexp. He cannot change the source of that function. He's asking for a good regexp to pass to that function. – slebetman Jan 18 '10 at 02:37
1

You probably don't need a regular expression for this... but going with the original question:

Make sure you're using RegexOptions.Compiled to pre-compile the regular expression and then reuse your RegEx object. This avoids setting up the RegEx every time you use it, this will speed things up a lot.

SoapBox
  • 20,457
  • 3
  • 51
  • 87