3

I'm trying to extract file paths (Windows/Ubuntu, relative/absolute) from a text document.

The regular expression code below is used check if a word is a file path or not.

It works for most of the cases but fails for one case, where it goes into an infinite loop. Any explanation for this?

import re
path_regex = re.compile(r'^([\.]*)([/]+)(((?![<>:"/\\|?*]).)+((?<![ .])(\\|/))?)*$' , re.I)
text = '/var/lib/jenkins/jobs/abcd-deploy-test-environment-oneccc/workspace/../workspace/abcd-deploy-test-environment.sh'
path_regex.search(text)
KernelPanic
  • 600
  • 8
  • 19
Hasmukh
  • 33
  • 5
  • 1
    What makes you think it goes into an infinite loop? Most likely it just takes an awfully long time. – kindall Feb 21 '17 at 17:55
  • 2
    This is simply *catastrophic backtracking*: a regex cannot go into an infinite loop (if it is implemented properly). – Willem Van Onsem Feb 21 '17 at 17:57
  • Infinite loop detectors are tricky to write. Maybe the OP is just using a heuristic, equating "awfully long time" with "infinite loop". – aghast Feb 21 '17 at 17:57
  • You have a construct like `(.+)*` in there, which results in catastrophic backtracking. `^\.*/+([^<>:"/\\|?*]+($|(?<![ .])[\\/]))+$` should work. – Aran-Fey Feb 21 '17 at 18:04
  • 1
    `((?![<>:"/\\|?*]).)+` is almost the same as `[^<>:"/\\|?*]+`. – Wiktor Stribiżew Feb 21 '17 at 18:06
  • i don't understand much about re implementation. i picked up the regex from here [link](http://stackoverflow.com/questions/24702677/regular-expression-to-match-a-valid-absolute-windows-directory-containing-spaces) , the regex was for windows pathnames , i did a mod to use it for linux pathnames . why re module does not raise an exception or something. – Hasmukh Feb 23 '17 at 04:06

2 Answers2

4

Indeed there is a problem.
You have overlayed subexpressions mixed with spurious quantifiers.

modified for required parts between slashes
It is easily fixed using this ^([\.]*)([/]+)((?:[^<>:"/\\|?*.\r\n]|\.(?![\\/]))[\\/]?)*$

The idea is to see just what your guarding against.
The guard is that you'd allow forward or back slash if not preceeded by a dot.

So, you have to include the dot in the exclusion class with the \ and /
then qualify them in a separate alternation.

If you do it this way, it will always pass.

 ^ 
 ( [\.]* )                     # (1)
 ( [/]+ )                      # (2)
 (                             # (3 start)
      (?:                           # Group start (required between slashes)
           [^<>:"/\\|?*.\r\n]            # Any character, but exclude these
        |                              # or,
           \.                            # The dot, if not followed by forward or back slash
           (?! [\\/] )
      )                             # Group end
      [\\/]?                        # Optional forward or back shash
 )*                            # (3 end)
 $
2

sln gave a good solution to your problem, so I'll try to explain what the problem is.

Welcome to the joys of catastrophic backtracking. The core of your problem is in (((?![<>:"/\\|?*]).)+((?<![ .])(\\|/))?)*. (Now that I've said that, all your problems are solved, right? Easy peasy.)

Assuming you're a bit like me and blinked blankly a couple of times the first time someone said "regex backtracking", we can work through your regex with the shorter input /path./. This is an invalid path according to your regex, but lets us (somewhat) easily walk through the problem.

^([\.]*)([/]+) matches the leading /. This works fine.

For the sake of readability here, I'm going to call the first half of the problematic capture group, ((?![<>:"/\\|?*]).)+, x+, and the second half, ((?<![ .])(\\|/))?, y?. The whole group is (x+y?).

How is (x+y?)*$ backtracking when matching path./?

  1. x+ matches path.
  2. y? matches (it matches 0 times, which is fine because of the ?)
  3. (x+y?) has now matched once
  4. (x+y?) repeats, and fails, since it does not match /. Thus, (x+y?)* has matched path.
  5. $ fails, since it does not match /.
  6. The regex engine backtracks:
    • (x+y?)* can only backtrack into its first iteration, since it only had one iteration.
    • Within that iteration, y? can't backtrack, since it matched 0 times.
    • x+ backtracks, to match only path instead of path.
  7. x+ matches path
  8. y? matches
  9. (x+y?) has now matched once (path)
  10. (x+y?) repeats and matches again:
    • x+ matches .
    • y? matches
  11. (x+y?) repeats, and fails since it does not match /. Thus, (x+y?)* has matched path.
  12. $ fails, since it does not match /.
  13. The regex engine backtracks:
    • (x+y?)* can only backtrack into its first iteration, since in its second iteration x+ matched only once and y? matched 0 times.
    • y? in the first iteration matched 0 times, so it can't backtrack
    • x+ can backtrack to only matching pat
  14. Hopefully you get the idea, but (x+y?) matches twice: pat, h.; then on the next backtrack we have pat h ., and then pa th., and so on.

It takes 478 steps for the engine to determine that /path./ is not matched by your regex. Every additional character in that problematic capture group increases the number of backtracks by a lot, and after a certain point your regex implementation is just going to throw up its hands and give up. sln's solution takes only 49 steps.

The behaviour of a regex engine when backtracking is difficult both to explain and to grasp, especially when limited to Markdown, so I would recommend running your regex through a debugger to visualize what's going on.

KernelPanic
  • 600
  • 8
  • 19