1

I want to grab all IDs (integers) from several URLs within a text. These URLs could look like these:

http://url.tld/index.php/p1
http://url.tld/p2#abc
http://url.tld/index.php/Page/3-xxx
http://url.tld/Page/4

For this, I've built two regexes (the URLs are enclosed by an URL bbcode):

\[url\](http\://url\.tld/index\.php/p(\d+).*?\)[/url\]
\[url\](http\://url\.tld(?:/index\.php)?/Page/(\d+).*?\)[/url\]

However, if i do a preg_match_all with every single regex, I get an array that looks like this (and which is correct):

array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(62) "[url]http://url.tld/index.php/Page/6-fdgfh/[/url]"
    [1]=>
    string(50) "[url]http://url.tld/Page/7[/url]"
  }
  [1]=>
  array(2) {
    [0]=>
    string(51) "http://url.tld/index.php/Page/6-fdgfh/"
    [1]=>
    string(39) "http://url.tld/Page/7"
  }
  [2]=>
  array(2) {
    [0]=>
    string(1) "6"
    [1]=>
    string(1) "7"
  }
}

But if I combine both regexes with a pipe:

\[url\](http\://url\.tld/index\.php/p(\d+).*?|http\://url\.tld(?:/index\.php)?/Page/(\d+).*?)\[/url\]

it builds an array like this (which is wrong):

array(4) {
  [0]=>
  array(3) {
    [0]=>
    string(71) "[url]http://url.tld/index.php/p9-abc#hashtag[/url]"
    [1]=>
    string(62) "[url]http://url.tld/index.php/Page/6-fdgfh/[/url]"
    [2]=>
    string(50) "[url]http://url.tld/Page/7[/url]"
  }
  [1]=>
  array(3) {
    [0]=>
    string(60) "http://url.tld/index.php/t9-abc#hashtag"
    [1]=>
    string(51) "http://url.tld/index.php/Page/6-fdgfh/"
    [2]=>
    string(39) "http://url.tld/Page/7"
  }
  [2]=>
  array(3) {
    [0]=>
    string(1) "9"
    [1]=>
    string(0) ""
    [2]=>
    string(0) ""
  }
  [3]=>
  array(3) {
    [0]=>
    string(0) ""
    [1]=>
    string(1) "6"
    [2]=>
    string(1) "7"
  }
}

====

So, my question is: How can I fix this? What I need is the array structure from the first example, while using both regular expressions as one regular expression, because I need a consistent structure to do a preg_replace_callback later.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
SGL
  • 341
  • 2
  • 15
  • would flattening the array help ? (http://stackoverflow.com/questions/1319903/how-to-flatten-a-multidimensional-array - specifically check out array_walk_recursive) – Amnon Mar 23 '14 at 01:24

1 Answers1

1

I think you're looking for the Branch Reset group:

\[url]((?|http://url\.tld/index\.php/p(\d+).*?|http://url\.tld(?:/index\.php)?/Page/(\d+).*?))\[/url]

Or, for the line-noise-challenged among us:

\[url]
(
  (?|
    http://url\.tld/index\.php/p(\d+)[^[]*
  |
    http://url\.tld(?:/index\.php)?/Page/(\d+)[^[]*
  )
)
\[/url]

This captures the numbers in group #2, no matter which part of the regex matched it. The whole URL is still captured in group #1.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • This works great. Thank you very much :) Just one more question: What, if i add a 3rd pattern? Would this still work or would it need further changes? – SGL Mar 23 '14 at 06:47
  • No problem, just add another pipe and paste the new regex after it. Make sure you do this inside the branch-reset group, just before the closing `)`. And make sure the new regex has only one capturing group. – Alan Moore Mar 23 '14 at 13:36
  • What would happen, if it would have more than one capturing group? Don't get me wrong, i just want to know that before i have to ask again :D – SGL Mar 23 '14 at 15:16