0

My script is simple:

<?php
$str = "mem: 9 334 23423343 3433434";

$num_matches = preg_match_all("/^mem:(\s+\d+)+$/", $str, $matches);
if (!$num_matches) {
        throw new Exception("no match");
}

echo "$num_matches matches\n";
var_dump($matches);

I was expecting that the pattern (\s+\d+)+ should match all of the numbers in $str but the output only shows the last match for some reason:

1 matches
array(2) {
  [0] =>
  array(1) {
    [0] =>
    string(27) "mem: 9 334 23423343 3433434"
  }
  [1] =>
  array(1) {
    [0] =>
    string(8) " 3433434"
  }
}

As you can see, $matches[1] contains only the last \s+\d+ occurrence in $str. I was expecting it should contain all of the matches: 9, 334, 23423343, 343434.

Is there some way to alter my pattern such that it returns all of these numbers for a string that may contain an arbitrary number of strings? Am I correct in thinking this is incorrect behavior by preg_match_all? Should I report it to the PHP devs?

EDIT: according to the docs, the default flag of PREG_PATTERN_ORDER:

Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.

S. Imp
  • 2,833
  • 11
  • 24
  • "mem:" is part of your match string, and so something is only a match if it includes that. You only have one string that includes "mem:" – Patrick Q Feb 08 '19 at 21:01
  • @PatrickQ I had the wrong output in my post so I edited it. Yes "mem" is part of the *whole* pattern but it's not part of the parenthetical. $matches[1] should contain all the matches of the first parenthetical -- it only has the last match. – S. Imp Feb 08 '19 at 21:10
  • This question might help: [Split camelCase word into words with php preg_match](https://stackoverflow.com/a/4519809/4362965) – ttvd94 Aug 10 '21 at 12:03

1 Answers1

2

PCRE stores the last occurrence in a repeating capturing group so the behavior is expected. To return individual matches in this case, you need to work with \G token as the following:

(?:^mem:|\G(?!^))\s+\K\d+

See live demo

Regex breakdown:

  • (?: Start of non-capturing group
    • ^mem: Match mem: at beginning of input string
    • | Or
    • \G(?!^) Start match from where previous match ends
  • ) End of non-capturing group
  • \s+\K Match any sequence of whitespaces then clear output
  • \d+ Match digits

PHP code:

preg_match_all("~(?:^mem:|\G(?!^))\s+\K\d+~", $str, $matches);
revo
  • 47,783
  • 14
  • 74
  • 117
  • that is some serious regex kung fu there. PHP's preg functions are certainly derived from PCRE, do you have any documentation link for the statement "PCRE stores the last occurrence in a repeating capturing group" ? That would be icing on the cake. – S. Imp Feb 08 '19 at 21:27
  • 1
    Read [on here](https://pcre.org/pcre.txt): *If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned.* – revo Feb 08 '19 at 21:36