8

I need a regex that will select only those URL strings NOT ending with specific extensions like .png or .css.

I tested the following:

1) this one using negative lookbehind:

(?<!\.png|\.css)$

https://regex101.com/r/tW4fO5/1

2) this other one using negative lookahead:

^(?!.*[.]png|.*[.]css$).*$

https://regex101.com/r/qZ7vA4/1

Both seems to work fine, but #1 (negative lookbehind) is said to be processed in 436 steps (see the link), while #2 (negative lookahead) is said to be processed in 173 steps.

So my question is: what does that mean? Is it going to have an impact on performances?

And lastly, are the two regex really functionally equivalent?

EDIT: SOLUTION SUMMARY

Just to wrap things up, considering the full list of string-endings to be excluded via the regex (a typical scenario would be a web server setup where static resources are served by apache while dynamic stuff is served by a different engine - in my case: php-fpm).

Two options are possible with PCRE regex:

1) negative lookbehind

$(?<!\.(?:ico|gif|jpg|png|css|rss|xml|htm|pdf|zip|txt|ttf)$|(?:js|gz)$|(?:html|woff)$)

https://regex101.com/r/eU9fI6/1

Notice that I used several OR-ed lookbehinds because the negative lookbehind requires a fixed-width pattern (ie: you cannot mix patterns of different lengths). This makes this options sligthly more complex to write. Moreover this lowers its performance in my opinion.

2) negative lookahead

^(?!.*[.](?:js|ico|gif|jpg|png|css|rss|xml|htm|html|pdf|zip|gz|txt|ttf|woff)$).*$

https://regex101.com/r/dP7uD9/1

The lookahead is slightly faster than the lookbehind. This is a test result from making 1 million iterations:

time lookbehind = 18.469825983047 secs
time lookahead = 14.316685199738 secs

If I had not the issue of the variable lenght patterns, I would pick the lookbehind since it looks more compact. Either one is good anyway. At the end, I went with the lookahead:

<LocationMatch "^(?!.*[.](?:js|ico|gif|jpg|png|css|rss|xml|htm|html|pdf|zip|gz|txt|ttf|woff)$).*$">
    SetHandler "proxy:unix:/var/run/php5-fpm.sock|fcgi://www/srv/www/gioplet/web/public/index.php"
</LocationMatch>
Laurel
  • 5,965
  • 14
  • 31
  • 57
Timido
  • 1,646
  • 2
  • 13
  • 16
  • in what regex flavor, platform and data set did you run your tests? – Ricardo Aug 27 '19 at 23:20
  • 1
    I tested using a simple loop with php (PREG) on Linux. At the time I was using php 5. Data set was a bunch of urls (strings) with those different endings – Timido Aug 29 '19 at 04:39

2 Answers2

4

Is it going to have an impact on performances?

In most cases, the more steps a regex needs to find a match, the slower the performance is. Although it also depends what platform you will use the regex in later (say, if you test a regex for use in .NET using regex101.com, it does not mean it will cause a catastrophic backtracking with a lazy dot matching regex failing with a long text).

Are the two regex really functionally equivalent?

No, they aren't. (?<!\.png|\.css)$ finds an end of the line that is not preceded with .png or .css. ^(?!.*[.]png|.*[.]css$).*$ finds lines that do not contain .png or the lines that do not end with .css. To make them "equivalent" (that is, if you want to make sure the lines ending with .png or .css are not matched), use

^(?!.*[.](?:png|css)$).*$
         ^^^^^^^^^^^^

Make sure the $ is checked after both png and css in the negative lookahead.

There will still be the difference between the regexps: the first will just match the end of the line, and the second will match the whole line.

Is there a way to speed up the lookbehind solution?

Note that the lookbehind in Pattern 1 is checked at each location inside the string. The lookahead in Pattern 2 is only checked once, at the very beginning of the string. That is why an anchored lookahead solution will be faster UNDER one condition - if you cannot use a RightToLeft modifier that is only available in few regex flavors (e.g. .NET).

The $(?<!\.(?:png|css)$) lookbehind solution is faster than Pattern 1 because the lookbehind pattern is checked just once, after reaching the end of string/line. Still, this takes a bit more steps because of the implementation of a lookbehind that is costlier than a lookahead.

To really find out which solution is fastest, you need to set up performance tests in your environment.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you for helping. I'm studying your suggestion to make them equivalent. As for the platform I will use them on, it is in in VirtualHost section of Apache configuration. – Timido Feb 18 '16 at 10:18
  • If you just need to *test*, check if the pattern matches a string, both will work. If you need to actually *get* the string, use the 2nd pattern. – Wiktor Stribiżew Feb 18 '16 at 10:28
  • Great, thank you, I perfectly understand my mistake in the #2 (lookahead). I only need to _test_ I guess I should choose the lookahead one since it seems to be faster (less steps) at least theoretically -- still studying the link suggested by vks – Timido Feb 18 '16 at 10:32
  • He linked to the question discussing *atomic groups*. You are not using `(?>...)` atomic groups. Yes, lookarounds in PCRE and most other flavors are atomic, but I know from experience that even if the number of steps does not directly correlate with performance, the regex taking 1000 steps is most probably slower than the regex taking 10 steps to complete (oh yes, it is possible to write 2 regexps matching the same text but with different productivity :() – Wiktor Stribiżew Feb 18 '16 at 10:38
  • The real difference is that the lookbehind in Pattern 1 is checked *at each location* inside the string. The lookahead in the second pattern is only checked *once*, at the very beginning. That is why an anchored lookahead solution will be faster UNDER one condition - if you cannot use a RightToLeft modifier that is only available in few regex flavors (e.g. .NET). Not available in Apache Virtual Host, I guess. and just FYI, check [`$(?<!\.(?:png|css)$)`](https://regex101.com/r/hX4kR8/2) lookbehind solution. See the difference? The lookbehind is also checked just once. Still takes more steps. – Wiktor Stribiżew Feb 18 '16 at 10:54
  • The last lookbehind version you are suggesting (`$(?<!\.(?:png|css)$)`) takes 3 times less steps so it's definitely faster (tested). It is actually comparable to the lookahead (if not even slightly faster). Unfortunately, it is not good to use as I'm discovering right now that lookbehind requires fixed-width pattern. That is, I cannot use it to exclude the following list of extensions: `js|ico|gif|jpg|png|css|rss|xml|htm|html|pdf|zip|gz|txt` – Timido Feb 18 '16 at 13:57
  • You still can use several lookbehinds in an alternation. It won't be that efficient, but it still is a workaround. Try `$(?:(?<!\.(?:png|css)$)|(?<!js$))`.... – Wiktor Stribiżew Feb 18 '16 at 14:24
  • Several lookbehinds in OR, nice one. It should be `$(?<!\.(?:png|css)$|(?:js)$)` though. – Timido Feb 18 '16 at 15:20
  • Quite possible, I cannot test what I type on a mobile. You got the gist however :-) – Wiktor Stribiżew Feb 18 '16 at 15:35
4

The second one or the lookahead one is faster. Remember number of steps is not the correct way. See the Stackoverflow question: atomic-groups-clarity.

I have tested on python using timeit. The script is

import timeit
s1="""
import re
re.findall(r"^.*(?<!\.png|\.css)$",x,re.M)"""

s2="""
import re
re.findall(r"^(?!.*[.]png$|.*[.]css$).*$",x,re.M)"""

print timeit.timeit(s1,number=1000000,setup='x="""http://gioplet/articles\nhttp://gioplet/img/logo.png\nhttp://gioplet/index.php\nhttp://gioplet/css/main.css"""')

print timeit.timeit(s2,number=1000000,setup='x="""http://gioplet/articles\nhttp://gioplet/img/logo.png\nhttp://gioplet/index.php\nhttp://gioplet/css/main.css"""')

Output:

8.72536265265
7.09159428305
Graham P Heath
  • 7,009
  • 3
  • 31
  • 45
vks
  • 67,027
  • 10
  • 91
  • 124