4

I have a regexp:

import re

regexp = re.compile(r'^(?P<parts>(?:[\w-]+/?)+)/$')

It matches a string like foo/bar/baz/ and put the foo/bar/baz in a group named parts (the /? combined with the /$ support this).

This works perfectly fine, until you match a string that doesn't end in a slash. Then, it gets slower at a seemingly-exponential rate with each new char you add to the string you're matching.

Example

# This is instant (trailing slash)
regexp.match('this-will-take-no-time-at-all/')

# This is slow
regexp.match('this-takes-about-5-seconds')

# This will not finish
regexp.match('this-probably-will-not-finish-until-the-day-star-turns-black')

I'm trying to understand why this specific recursion issue only happens when the /$ (trailing slash) isn't in the string (i.e., a non-match). Could you help me understand the control flow of the underlying algorithm in both the trailing slash and the non trailing slash cases?

Note

I'm not looking for a solution for my desired pattern. I'm trying to understand the specific regexp.

orokusaki
  • 55,146
  • 59
  • 179
  • 257
  • 1
    paste your pattern and example to [regex101.com](https://regex101.com/#python), switch to the debugger and uncheck the internal optimizations - then have a look :) the engine tries around with all possible combinations of both +-repetitions, trying to find a match somehow. – Sebastian Proske Mar 27 '16 at 14:07
  • @SebastianProske thanks for the suggestion, but regex101.com simply told me instantly in the debugger that the pattern simply didn't match, which means that their engine (JS?) didn't use the same algorithm as Python (which would have never finished). – orokusaki Mar 27 '16 at 14:10
  • 3
    Again a pattern like `(A+)+`. It is catastrophic backtracking that causes the issue. I guess you need `^(?P[\w-]+(?:/[\w-]+)*)/$` – Wiktor Stribiżew Mar 27 '16 at 14:10
  • @orokusaki you should [turn off the internal optimizations](http://imgur.com/rBZqnNW) and you see what happens. – Sebastian Proske Mar 27 '16 at 14:11
  • Similar situation: http://stackoverflow.com/questions/22650098/very-slow-regular-expression-search with several great explanations. – alecxe Mar 27 '16 at 14:33

2 Answers2

3

It is getting slow due to catastrophic backtracking in your regex:

You can fix catastrophic backtracking by using this regex:

^(?P<parts>(?:[\w-]+/)*[\w-]+)/$

As per the link above:

The solution to avoid catastrophic backtracking is simple. When nesting repetition operators, make absolutely sure that there is only one way to match the same match.

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    It would be nice to know _why_ this fixes catastrophic backtracking. – Kenneth K. Mar 27 '16 at 14:20
  • 1
    @Kenneth K, it fixes the backtracking, because the new expression can only match his desired target _one way_, so it won't try to match a failing expression different ways. – Kyle A Mar 27 '16 at 14:24
  • The link I've provided has great explanation on **Catastrophic Backtracking** and also ways to avoid it. I am quoting from above link **The solution is simple. When nesting repetition operators, make absolutely sure that there is only one way to match the same match** – anubhava Mar 27 '16 at 14:26
  • 1
    According to _Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline._ you should actually do so ;) – Sebastian Proske Mar 27 '16 at 14:29
1

Wiktor Stribizew is correct. The issue is the question mark after the slash inside the repeating pattern. So, the pattern you gave:

'^(?P<parts>(?:[\w-]+/?)+)/$'

Says, look for one or more groups of one or more word characters or dashes possibly followed by a slash, then there should be a slash at the very end.

So, for a string like arm/, the inner grouping could be any of:

(arm)/
(ar)(m)/
(a)(rm)/
(a)(r)(m)/

As the string gets longer, if the regex fails to match on its first try, it will check more and more combinations to try to match. To avoid this the same match could be achieved with:

'^(?P<parts>(?:[\w-]+/)*[\w-]+)/$'

because this version can only match your target strings one way.

Kyle A
  • 928
  • 7
  • 17
  • The regex you suggest will not yield expected results for input like `abc-def/`. – Wiktor Stribiżew Mar 27 '16 at 15:12
  • @WiktorStribiżew is correct - my original pattern matches `foo/`, `foo/bar/`, `foo/bar/baz/`, etc. Your pattern only matches when there are at least 2 segments (i.e., only `foo/bar/` and `foo/bar/baz/` from the aforementioned three examples). – orokusaki Mar 27 '16 at 15:39
  • Corrected my regex by replacing a `+` with a `*`. @Wiktor Stribizew and @orokusaki are correct about my typo. – Kyle A Mar 27 '16 at 16:31