4

I have this regex:

regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i

And when I use it on some, but not all, texts e.g. this one:

text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"

like so: text.match(regex), then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)

ruby version: 2.7.2

Niels Kristian
  • 8,661
  • 11
  • 59
  • 117
  • 1
    Looks like an issue with the positive character class, and using the negated version will solve it, `/Si.ges[^\d_]*avec\W*fonction\W*m.moires/i` – Wiktor Stribiżew Feb 10 '22 at 12:23
  • 2
    There is no match for `(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)` – The fourth bird Feb 10 '22 at 12:37
  • 2
    @Thefourthbird Sure, that is why there is a problem. – Wiktor Stribiżew Feb 10 '22 at 12:41
  • 2
    It would be interesting to condense this into an [mcve], with an emphasis on the word "minimal". What is the shortest/simplest string and regex you can find to replicate the issue? – Tom Lord Feb 10 '22 at 12:56
  • 1
    Is it an infinite loop or is it just a very long loop? – Jörg W Mittag Feb 10 '22 at 13:02
  • 2
    @WiktorStribiżew Ah, catastrophic backtracking probably, but I would not expect that. Do you know why that is? – The fourth bird Feb 10 '22 at 13:04
  • 2
    @Thefourthbird It does not look like catastrophic backtracking (at least in its usual form). The `i` modifier and `\W` in the character class seems to cause the issue. [Here](https://bugs.ruby-lang.org/issues/4044), I believe, we can find something to explain this. Probably, related to "*mutiple Case Unfold definitions in unicode.c*". – Wiktor Stribiżew Feb 10 '22 at 13:13
  • 1
    So, the regex is surely not defined as it was intended to work and I can solve that with the colleague of mine who wrote it - but - what really concerns me here is I would not have expected that you could end up in a death spiral like this without getting at least an error, just because you write a poor regex and used it on a particular string. It’s kind of scary I would say... – Niels Kristian Feb 10 '22 at 13:26
  • @WiktorStribiżew: interesting link, but all the bugs related in this page seems to be solved in more recent Ruby versions (at least since 2.5 on which I tested them). I think the fourth bird theory may be correct: `a.*b.*c` is clearly a pathological pattern. – Casimir et Hippolyte Feb 10 '22 at 15:16
  • @CasimiretHippolyte `a.*b.*c` is pathological as `.` matches `a`, `b` and `c`, but `a\W*b\W*c` should not be pathological as `\W` does not match these letters. At any rate, removing `i` flag solves the whole problem. – Wiktor Stribiżew Feb 10 '22 at 15:19
  • @WiktorStribiżew: I read too quicly, I thought it was `a[\Wa-z]*b[\Wa-z]*c`, so something very near from `a.*b.*c` but it isn't the case. After several tests with only `/Si.gesX*avec\W*fonction/i` (with X the character class), the problem occurs when `\W` is with `a-z` (whatever the way you find to hide the fact there is the full `a-z` range), but the problem disappears if you remove a letter from the range. Same problem with `\W\p{L}`. Even with a succeeding pattern, the result is very slow. – Casimir et Hippolyte Feb 10 '22 at 16:32
  • Yeah, I came to the same conclusion - the point is - is this a bug in ruby's regex implementation, because it does not seem to be the same problem with e.g. running the regex match in C++... – Niels Kristian Feb 11 '22 at 08:55
  • I didn't go through all the comments, but could it have something to do with this section here? [a-zA-Z\W] \W means [^a-zA-Z0-9_] so you may see how it could generate a conflict. – Odd One Out Feb 14 '22 at 19:15

1 Answers1

1

Built-in character classes are more table-driven.
Given that, Negative built-in ones like \W, \S etc...
are difficult for engines to merge into a positive character class.

In this case, there are some obvious bugs because as you've said, it doesn't time out on
some target strings.

In fact, [a-xzA-XZ\W] works given the sample string. It times out when Y is included anywhere
but just for that particular string.

Let's see if we can determine if this is a bug or not.

First, some tests:

Test - Fail [a-zA-Z\W]

https://rextester.com/FHUQG84843

# Test - Fail  [a-zA-Z\W]
puts "Hello World!";
regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/ui;
text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ";
res = text.match(regex);
puts "Done";

Test - Pass [a-xzA-XZ\W]

https://rextester.com/RPV28606

Test - Pass [a-zA-Z\P{Word}]

https://rextester.com/DAMW9069


Conclusion: Report this as a BUG.
IMO this is a BUG with their built-in class \W which is engine defined,
since \P{Word} is a Unicode property defined function, not a range.
And we see that [a-zA-Z\P{Word}] works just fine.
Use \P{Word} inside classes as a temporary workaround.

In reality when modern-day engines were first designed, the logic of what
a negative class was [^] each item is AND NOT which when combined with a positive
class where each item is ORed results in errors in scope.
Perl had class errors still a short time ago.

sln
  • 2,071
  • 1
  • 3
  • 11