8

Is there a way in the PHP regex functions to get all possible matches of a regex even if those matches overlap?

e.g. Get all the 3 digit substrings '/[\d]{3}/'...

You might expect to get:

"123456" => ['123', '234', '345', '456']

But preg_match_all() only returns

['123', '456']

This is because it begins searching again after the matched substring (as noted in the documentation):

"After the first match is found, the subsequent searches are continued on from end of the last match.".

Is there a way around this without writing a custom parser?

Jagu
  • 2,471
  • 2
  • 22
  • 26

3 Answers3

11

Look-ahead assertions to the rescue!

preg_match_all('/(?=(\d{3}))/', $str, $matches);
print_r($matches[1]);

It basically captures whatever the look-ahead assertion is matching. Since the assertion is zero width, $matches[0] will only contain empty strings, but $matches[1] will contain the expected captured patterns.

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • 1
    Ah, I was on the right track. Your answer is certainly an improvement on mine though. Nice work :) – maček Mar 17 '14 at 21:22
  • Thank you. This worked wonderfully. For interest sake in my case I was attempting to match 11 digit Australian Business Numbers (ABNs) that had optional spaces and dashes. My final working regex using your solution (and help from maček) was this: '/(?=(\b(\d[\s-]*){10}\d\b))/' The results get passed through a second function to ensure they are a valid ABN using a checksum. – Jagu Mar 18 '14 at 02:21
2

This may not be ideal, but at least it's something.

It looks like you could use a positive lookahead and PREG_OFFSET_CAPTURE to get all the string indexes for where a 3-digit number exists

$str = "123456";

preg_match_all("/\d(?=\d{2})/", $str, $matches, PREG_OFFSET_CAPTURE);

$numbers = array_map(function($m) use($str){
  return substr($str, $m[1], 3);
}, $matches[0]);

print_r($numbers);

Output

Array
(
    [0] => 123
    [1] => 234
    [2] => 345
    [3] => 456
)
maček
  • 76,434
  • 37
  • 167
  • 198
  • That is an ingenious solution. I'll leave the question open a little longer as unfortunately it won't work in my case (other complications I didn't explain in the question) in case someone has another solution. But thank you! I'll give you the point if nothing comes up. – Jagu Mar 17 '14 at 12:43
2

With \K inside a lookbehind:

preg_match_all('~(?<=\K..).~', '123456', $m);
print_r($m[0]);

demo

Only one character is consumed (the third), the first two are not since they are inside a lookbehind that is a zero-width assertion. But the \K gives the start of the match result and the first two are returned (with the third).

Notice: You can't put all the three characters in the lookbehind and write (?<=\K...), because in this case the regex engine will stay forever at the same position in the string.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thanks for this answer, really nice one! It works in PHP, funnily when trying it in regex101 the tool throws an [error](https://regex101.com/r/pYPR8g/1): *`\K` This token can not be used in a lookbehind* – bobble bubble Jun 04 '22 at 10:42
  • 1
    @bobblebubble: yes, the reason is that pcre2 has an extra compilation option: PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK (BSK: BackSlash K). This option is activated in regex101, but not in PHP. – Casimir et Hippolyte Jun 06 '22 at 11:29
  • Amazing how you found that and just read about it myself. Somehow funny how they called this option! – bobble bubble Jun 06 '22 at 11:47
  • 1
    @bobblebubble: The source: http://pcre.org/pcre2.txt (search: "Extra compile options") – Casimir et Hippolyte Jun 06 '22 at 11:58