0

I have a string that I run a regex against it but I get an error.

PHP:

<?php 
$str = 'modified:   apps/aaaaaa/bbbbbb/cccc
    modified:   apps/ami (new commits)
    modified:   apps/assess (new commits)
    modified:   apps/ees121 (new commits)
    modified:   apps/energy_bourse (new commits)
    modified:   apps/gis (new commits)
    modified:   apps/hse (new commits)
    modified:   apps/aa/aaa/a/bb/b/bb/bc/c22/s/df/s/
    modified:   apps/management (new commits)
    modified:   apps/payesh (new commits)
    modified:   external_apps (modified content)
    modified:   modules/esb_server (new commits)
    modified:   modules/formbuilder (new commits, modified content)
    modified:   modules/reporting (new commits)
    modified:   modules/safir (new commits)
    modified:   modules/workflow (new commits)
    modified:   vendor/raya_framework/client (new commits)
    modified:   vendor/raya_framework/core (new commits)';
preg_match_all("/modified:\s+((\w+\/?)+).*\)/", $str, $matches);
var_dump($matches);

This works fine but if I append a few characters to one of the lines, nothing's matched. E.g.:

modified:   apps/aaaaaa/bbbbbb/ccccfff

This seems to depend on word characters and not forward slashes. Why do some characters make a difference here? and what can I do?

revo
  • 47,783
  • 14
  • 74
  • 117
vatandoost
  • 2,799
  • 2
  • 12
  • 13
  • You have a few evil quantifiers that make this to run forever. It's a good reason for *Catastrophic Backtracking*. Try `modified:\s+((?>\w++\/?)+)[^\v)]*\)` instead. – revo Sep 10 '18 at 06:32
  • 2
    @revo: If it's tested then can you please add an answer so it may help to others as well – Bhavin Solanki Sep 10 '18 at 06:34

1 Answers1

2

Those extra characters make regex engine to reach backtracking steps limit:

var_dump(preg_last_error() === PREG_BACKTRACK_LIMIT_ERROR); // will return `true`

Your regex is almost short and may seem right but in fact, it abuses quantifiers and that causes a Catastrophic Backtracking to happen on failures. When it fails to match a ) at the end of a sequence of \w+/? patterns it tries to backtrack into all previous sub-expressions in the hope of finding a ). But it never happens and nested quantified groups and tokens make this process like to run forever.

The solution is re-constructing your regex to consider this:

modified:\s+((?>\w+\/?)+).*\)

I just made second capturing group an atomic group. Atomic groups - as the name - wouldn't allow backtracks into the cluster. So if it fails to find a pattern after matching a \w+\/? it never backtracks into \w+\/? and this makes an early failure to happen.

A correct modification on this regex would be replacing .* with something more restrictive:

modified:\s+((?>\w+\/?)+)[^)\v]*\)

See live demo here

PHP code:

preg_match_all('~modified:\s+((?>\w+/?)+)[^)\v]*\)~', $str, $matches);
revo
  • 47,783
  • 14
  • 74
  • 117
  • 1
    Thanks for the really interesting link about catastrophic backtracking. – Nick Sep 10 '18 at 08:19
  • Your regex doesn't capture all the groups, it skips the ones which follow a line with no `(new commits)` or similar – Nick Sep 10 '18 at 08:33
  • @Nick Because it needs to find a `)` at the same line. Please see the OP's regex. – revo Sep 10 '18 at 08:35
  • @Nick There is [this answer](https://stackoverflow.com/a/50511784/1020526) too that provides some helpful hints while writing a regex you may want to take a look. – revo Sep 10 '18 at 08:35
  • @Nick The regex you used in your link isn't mine. You made second part optional just copy / paste the one in this answer. But the problem is with double quotes. Remove them and use single quotes. I'll edit my answer too to reflect this. – revo Sep 10 '18 at 08:54
  • Oops. that's embarrassing, I had been playing with yours and stuffed it up. I'll delete that comment... – Nick Sep 10 '18 at 08:56
  • No problem at all. – revo Sep 10 '18 at 09:08