0

I have several proxy rule files for Squid, and all contain rules like:

acl blacklisted dstdom_regex ^.*facebook\.* ^.*youtube\.* ^.*games.yahoo.com\.*

The patterns match against the domain name: dstdom_regex means destination (server) regular expression pattern matching.

The objective is to block some websites, but I don't know by what method: domain name, keywords in the domain name, ...

Let's expand/describe the pattern:

^.*stackexchange\.*     The whole pattern
^                       String beginning
 .*                     Match anything (greedy quantifier, I presume)
   stackexchange        Keyword to match
                \.*     Any number of dots (.)

Totally legitimate matches:

  • stackexchange.com: The Stack Exchange website.
  • stackoverflow.stackexchange: The imaginary Stack Exchange gTLD.

But these possible matches make it seem more like a keyword block:

  • stackexchange
  • stackexchanger
  • notstackexchange
  • not-stackexchange
  • some-website.stackexchange
  • some-website.stackexchange-tld

And the pattern seems to contain a bug, since it allows the following invalid cases to match, thanks to the \.* at the end, although they never naturally occur:

  • stackexchange.
  • stackexchange...
  • stackexchange..........
  • stackexchange.......com
  • stackexchange.com
  • stackexchangecom
  • you get the idea.

Anything containing stackexchange, even if separated by dots from everything else, is still a valid match.


So now, the question itself:

This all means that this is simply a match for stackexchange! (I'm assuming the original author didn't intend to match infinite dots.)

So why not just use the pattern stackexchange? Wouldn't it be faster and give the same results, except for the "bug" (\.*)?

I.e., isn't ^.*stackexchange equivalent to stackexchange?


Edit: Just to clarify, I didn't write those proxy rule files.

g4v3
  • 133
  • 2
  • 10

1 Answers1

0

I don't understand why you use \.* to match all the following dots

However to bypass your problem you can try this out :

  • ^[^\.]*\.stackexchange\.*

[^\.]* matches anything except a dot \. then you match the dot

edit : formatting

arhr
  • 1,505
  • 8
  • 16
  • Just to clarify, I didn't write those proxy rule files. Me too, I don't understand why the `\.*` is used there! – g4v3 Jul 07 '16 at 13:08
  • Ok, essentially you just want to test if the current URL is from a certain domain : `^[^\.]*\.stackexchange.*` might be better then – arhr Jul 07 '16 at 13:10
  • I only know these rules are intended to block some websites, but I don't know if who wrote them decided to block keywords instead of domains. – g4v3 Jul 07 '16 at 13:13
  • Are you sure about the `\ ` in `[^\.]`? Within character classes a `.` doesn't need to be escaped since it has no special meaning. It makes no sense to use it as a wildcard inside a character class. – g4v3 Jul 07 '16 at 21:34
  • `[^\.]` means : any char except `\.` which is literally the `.` – arhr Jul 08 '16 at 15:18
  • Are you sure? Usually you don't need to escape the `.` inside a character class, but I don't know about Squid. Take a look at this: http://stackoverflow.com/a/19976308 – g4v3 Jul 08 '16 at 17:13