20

Update/Note:

I think what I'm probably looking for is to get the captures of a group in PHP.

Referenced: PCRE regular expressions using named pattern subroutines.

(Read carefully:)


I have a string that contains a variable number of segments (simplified):

$subject = 'AA BB DD '; // could be 'AA BB DD CC EE ' as well

I would like now to match the segments and return them via the matches array:

$pattern = '/^(([a-z]+) )+$/i';
$result = preg_match_all($pattern, $subject, $matches);

This will only return the last match for the capture group 2: DD.

Is there a way that I can retrieve all subpattern captures (AA, BB, DD) with one regex execution? Isn't preg_match_all suitable for this?

This question is a generalization.

Both the $subject and $pattern are simplified. Naturally with such the general list of AA, BB, .. is much more easy to extract with other functions (e.g. explode) or with a variation of the $pattern.

But I'm specifically asking how to return all of the subgroup matches with the preg_...-family of functions.

For a real life case imagine you have multiple (nested) level of a variant amount of subpattern matches.

Example

This is an example in pseudo code to describe a bit of the background. Imagine the following:

Regular definitions of tokens:

   CHARS := [a-z]+
   PUNCT := [.,!?]
   WS := [ ]

$subject get's tokenized based on these. The tokenization is stored inside an array of tokens (type, offset, ...).

That array is then transformed into a string, containing one character per token:

   CHARS -> "c"
   PUNCT -> "p"
   WS -> "s"

So that it's now possible to run regular expressions based on tokens (and not character classes etc.) on the token stream string index. E.g.

   regex: (cs)?cp

to express one or more group of chars followed by a punctuation.

As I now can express self-defined tokens as regex, the next step was to build the grammar. This is only an example, this is sort of ABNF style:

   words = word | (word space)+ word
   word = CHARS+
   space = WS
   punctuation = PUNCT

If I now compile the grammar for words into a (token) regex I would like to have naturally all subgroup matches of each word.

  words = (CHARS+) | ( (CHARS+) WS )+ (CHARS+)    # words resolved to tokens
  words = (c+)|((c+)s)+c+                         # words resolved to regex

I could code until this point. Then I ran into the problem that the sub-group matches did only contain their last match.

So I have the option to either create an automata for the grammar on my own (which I would like to prevent to keep the grammar expressions generic) or to somewhat make preg_match working for me somehow so I can spare that.

That's basically all. Probably now it's understandable why I simplified the question.


Related:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • If you're generalising your question so much that alternative though correct answers can be given, your question isn't that valuable. Don't simplify if you don't want the simplified answers. -1. – Berry Langerak Jun 16 '11 at 12:04
  • 1
    I'm looking for an answer on a specific topic. I don't see why simplification should be bad to make this visible, albeit I see that a certain level of abstractness can be a burden. – hakre Jun 16 '11 at 12:10
  • 1
    Well, obviously, because you want an answer on a subgroup, while your example doesn't include the need for a subgroup. The example is flawed. – Berry Langerak Jun 16 '11 at 12:24
  • @Berry Langerak: There is always some loss in simplification. You find a more detailed example added now. – hakre Jun 16 '11 at 12:55
  • Just stumbled over: `J (PCRE_INFO_JCHANGED)` - The `(?J)` internal option setting changes the local `PCRE_DUPNAMES` option. Allow duplicate names for subpatterns which might not solve this here but is generally interesting: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php – hakre Aug 30 '11 at 20:53
  • Could `preg_split` be extrapolated? [Split string by delimiter, but not if it is escaped](http://stackoverflow.com/q/6243778/367456). – hakre Nov 28 '11 at 22:54
  • a http://stackoverflow.com/a/8198121/367456 of q http://stackoverflow.com/q/8197469/367456 – hakre Nov 28 '11 at 23:12
  • Another related question is: [Collapse and Capture a Repeating Pattern in a Single Regex Expression](http://stackoverflow.com/q/15268504/367456) - It got some attention lately. – hakre May 12 '13 at 09:42

8 Answers8

4

Similar thread: Get repeated matches with preg_match_all()

Check the chosen answer plus mine might be useful I will duplicate there:

From http://www.php.net/manual/en/regexp.reference.repetition.php :

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration.

I personally give up and going to do this in 2 steps.

EDIT:

I see in that other thread someone claimed that lookbehind method is able doing it.

user109764
  • 576
  • 6
  • 11
3

Try this:

preg_match_all("'[^ ]+'i",$text,$n);

$n[0] will contain an array of all non-space character groups in the text.

Edit: with subgroups:

preg_match_all("'([^ ]+)'i",$text,$n);

Now $n[1] will contain the subgroup matches, that are exactly the same as $n[0]. This is pointless actually.

Edit2: nested subgroups example:

$test = "Hello I'm Joe! Hi I'm Jane!";
preg_match_all("/(H(ello|i)) I'm (.*?)!/i",$test,$n);

And the result:

Array
(
    [0] => Array
        (
            [0] => Hello I'm Joe!
            [1] => Hi I'm Jane!
        )

    [1] => Array
        (
            [0] => Hello
            [1] => Hi
        )

    [2] => Array
        (
            [0] => ello
            [1] => i
        )

    [3] => Array
        (
            [0] => Joe
            [1] => Jane
        )

)
aorcsik
  • 15,271
  • 5
  • 39
  • 49
  • I'm interested in the matches of a variant number of subgroup matches. Your regex does not have any subgroups. – hakre Jun 16 '11 at 11:52
  • Well then I don't understand your question. There is non need for subgroups for the matching you asked for. – aorcsik Jun 16 '11 at 11:55
  • it's not only you that don't understand the question. Is the question that is completely wrong because Hakre can't explain himself. -1 for the question – dynamic Jun 16 '11 at 11:56
  • I've added a little more info to make visible that it has a certain level of abstraction / generalization. – hakre Jun 16 '11 at 12:00
2

Is there a way that I can retrieve all matches (AA, BB, DD) with one regex execution? Isn't preg_match_all not suitable for this?

Your current regex seems to be for a preg_match() call. Try this instead:

$pattern = '/[a-z]+/i';
$result = preg_match_all($pattern, $subject, $matches);

Per comments, the ruby regex I mentioned:

sentence = %r{
(?<subject>   cat   | dog        ){0}
(?<verb>      eats  | drinks     ){0}
(?<object>    water | bones      ){0}
(?<adjective> big   | smelly     ){0}
(?<obj_adj>   (\g<adjective>\s)? ){0}
The\s\g<obj_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x

md = sentence.match("The cat drinks water");
md = sentence.match("The big dog eats smelly bones");

But I think you'll need a lexer/parser/tokenizer to do the same kind of thing in PHP. :-|

Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
  • Please read the longer example at the end. I'm really looking into subgroup pattern matching over a full match that spares me to write a parser for groups and repetition of the BNF grammar. Therefore I need all of the (sub) matches while consuming the whole subject. `preg_match_all` will from it's subpatterns always return the last match when those can have a repetition. – hakre Jun 16 '11 at 18:42
  • I think what you're trying to do is achievable with named groups and a recursive regex, but I'm not sure that PHP supports the latter. You might be able to manage it in ruby, though. – Denis de Bernardy Jun 16 '11 at 18:52
  • I'll chew on it a bit this evening. – Denis de Bernardy Jun 16 '11 at 19:03
  • Btw, what's wrong with the idea of doing: `$pattern = '/regex1|regex2/'` in my above suggestion? You'd arguably need to test each one for punctuation, but at least they'll be split properly and the individual word/punct groups will be extracted, no? – Denis de Bernardy Jun 16 '11 at 19:08
  • No because it's grammar: There is at least one group per word and there is the semantics of the words together to form the next word of the grammar. So it's stacked. And it's with optional repetition inside these stacks. So if I only could grab the data of the matches, this would be perfect. However it's returning only the last backreference. would be cool to have a stack of backreference even after regex execution. – hakre Jun 16 '11 at 19:27
  • Last question... Have you looked into PHP-based lexers and tokenizers? I ask, because it may be that what you're trying to parse [won't necessarily be achievable](http://en.wikipedia.org/wiki/Chomsky_hierarchy) using regular expressions. – Denis de Bernardy Jun 16 '11 at 19:30
  • Yes I did but I'm always open to suggestions. I experimented with the pear, the lemon and the java one. As for chomsky: I have the code to validate already the whole value and it works great. My problem is the slicing, so that actually I come one step ahead from tokens into the elements of the grammar. – hakre Jun 16 '11 at 19:33
  • Yeah, the thing is, I'm suspicious that you'll be able to manage this using regular expressions. I could arguably post the regex from p.135 of "Programming Ruby 1.9", but I'm a) suspicious they work in PHP (in fact, nearly certain they don't, due to the recursive regex flavor) and b) still suffer from not matching all of the individual tokens. (The syntax is, basically `/?cat|dog)meows|barks)The\s\g\s\g/` with a recursive twist to it.) – Denis de Bernardy Jun 16 '11 at 19:37
  • (I've added the above-mentioned regex, for information.) – Denis de Bernardy Jun 16 '11 at 19:43
  • The issue is, the catching problem is still around I think. I'm pretty sure that replacing the `(\g\s)?` with `(\g\s)+` would yield an issue similar to that which you're getting with `preg_match_all()`. – Denis de Bernardy Jun 16 '11 at 20:14
  • That said, my previous comment prompted a thought. Why not match and capture `([a-z]+ )+` and explode() the result? – Denis de Bernardy Jun 16 '11 at 20:16
  • I do not understand what you mean. What should explode do? if this gets somewhere deeper, explode is linear. – hakre Jun 16 '11 at 20:18
  • It was a silly comment. I considered it further and it would be equivalent to calling `explode(' ', $str)` directly. :-( – Denis de Bernardy Jun 17 '11 at 11:26
  • php also allows definitions:http://stackoverflow.com/questions/2583472/regex-to-validate-json – useless Jun 04 '14 at 13:41
  • @useless: since a very recent version, yes. – Denis de Bernardy Jun 04 '14 at 15:00
  • I would not know what you consider as recent, but php 5.2 has been around for around 8 years (since 2006), I am sure 5.2 supports it and am almost sure that any php 5.0 also does. – useless Jun 10 '14 at 00:30
1

You can't extract the subpatterns because the way you wrote your regex returns only one match (using ^ and $ at the same time, and + on the main pattern).

If you write it this way, you'll see that your subgroups are correctly there:

$pattern = '/(([a-z]+) )/i';

(this still has an unnecessary set of parentheses, I just left it there for illustration)

kapa
  • 77,694
  • 21
  • 158
  • 175
  • Is it possible to make the expression always consume the whole subject? – hakre Jun 16 '11 at 12:05
  • @hakre My regex? Yes, it will. It will return all the patterns that match the rule. Actually `'/([a-z]+) /i'` should be enough. – kapa Jun 16 '11 at 12:09
  • When I add a `#` to the end of subject, it does return matches albeit it does not consumes the whole `$subject`. I had added start and end marker to my pattern because I wanted to stretch it over the full contents of `$subject`. – hakre Jun 16 '11 at 12:12
  • @hakre What do you want to happen when a `#` is added at the end of string exactly? Your pattern consumes the whole string, the `#` will just not be matched. If you need it to be matched, you need a different regex. Please explain what do you exactly want. – kapa Jun 16 '11 at 12:17
  • Hmm, so you do not see a way to use `^` and `$` within the pattern? I was building a parser that transforms a ABNF into regex and I want to preserve the matching of subgroups but the grammer needs to always match all words in sentences and groups - as a whole. – hakre Jun 16 '11 at 12:18
  • @hakre Nope. Then you will match the whole string (which is not your goal). I could help if you clarified what you exactly want to happen. – kapa Jun 16 '11 at 12:20
  • I want to match the whole string, but I want to get all subpattern matches as well - perhaps it's not possible with preg_match_all. That's just what I would like to know. – hakre Jun 16 '11 at 12:21
  • 1
    @hakre Possibly you could match the whole string with `preg_match()`, and if it is fine, run the `preg_match_all()` to extract the values. – kapa Jun 16 '11 at 12:22
  • @bazmegakapa: I added an example for some background info. – hakre Jun 16 '11 at 12:52
0

Yes your right your solution is by using preg_match_all preg_match_all is recursive, so dont use start-with^ and end-with$, so that preg_match_all put all found patterns in an array.

Each new pair of parenthesis will add a New arrays indicating the different matches

use ? for optional matches

You can Separate different groups of patterns reported with the parenthesis () to ask for a group to be found and added in a new array (can allow you to count matches, or to categorize each matches from the returned array )

Clarification required

Let me try to understand you question, so that my answer match what you ask.

  1. Your $subject is not a good exemple of what your looking for?

  2. You would like the pregmatch search, to split what you provided in $subject in to 4 categories , Words, Characters, Punctuation and white spaces ? and what about numbers?

  3. As well you would like The returned matches, to have the offsets of the matches specified ?

Does $subject = 'aa.bb cc.dd EE FFF,GG'; better fit a real life exemple?

I will take your basic exemple in $subject and make it work to give your exactly what your asked.

So can you edit your $subject so that i better fit all the cases that you want to match

Original '/^(([a-z]+) )+$/i';

Keep me posted, you can test your regexes here http://www.spaweditor.com/scripts/regex/index.php

Partial answer

/([a-z])([a-z]+)/i

AA BB DD CD

Array
(
    [0] => Array
        (
            [0] => AA
            [1] => BB
            [2] => DD
            [3] => CD
        )

    [1] => Array
        (
            [0] => A
            [1] => B
            [2] => D
            [3] => C
        )

    [2] => Array
        (
            [0] => A
            [1] => B
            [2] => D
            [3] => D
        )

)
GuruJR
  • 336
  • 1
  • 10
  • 1
    No that is not the solution. Your example can not even validate that the whole string matches the regex, you've just shifted the problem onto a subset of the string instead of the whole string. Also where are the subgroups and all their matches/captures? – hakre Oct 07 '12 at 07:38
  • I want to run preg_match_all and want to get all subgroup captures, not only the last ones. – hakre Oct 07 '12 at 16:24
  • @hakre there is 2 1/2 types of subgroups, Cause your regex is flawed. all proper answers will be wrong, we dont know what kind of results you want, give us an exemple of the result array you want. – GuruJR Oct 13 '12 at 08:11
  • 1
    `((a)(b)){2})` => return the *two* outer group matches, return the *two* inner group matches which then exist *two* times for example. This example could be a subgroup as well, not only the whole pattern. AFAIK this is not possible with PHP's regex engine in one go. – hakre Oct 13 '12 at 08:29
  • I should put the example I give in the question into code so that it's abstract character get's some more "hands-on-like" representation. That should help maybe. – hakre Oct 13 '12 at 08:41
  • Preg_match_all is recursive, so dont use start-with `^` and end-with `$` Cauze has your regex, it will only give you a submatche on something that matches everything , wich is the last DD_ – GuruJR Oct 13 '12 at 08:58
0

Edit

I didn't realize what you had originally asked for. Here is the new solution:

$result = preg_match_all('/[a-z]+/i', $subject, $matches);
$resultArr = ($result) ? $matches[0] : array();
moteutsch
  • 3,741
  • 3
  • 29
  • 35
  • That regex does not have any subgroups. I was looking for matches of subgroups specifically. – hakre Jun 16 '11 at 11:53
0

How about:

$str = 'AA BB CC';
$arr = preg_split('/\s+/', $str);
print_r($arr);

output:

(
    [0] => AA
    [1] => BB
    [2] => CC
)
Toto
  • 89,455
  • 62
  • 89
  • 125
0

I may have misunderstood what you're describing. Are you just looking for a pattern for groups of letters with whitespace between?

// any subject containing words:
$subject = 'AfdfdfdA BdfdfdB DdD'; 
$subject = 'AA BB CC';
$subject = 'Af df dfdA Bdf dfdB DdD';

$pattern = '/(([a-z]+)\s)+[a-z]+/i';

$result = preg_match_all($pattern, $subject, $matches);
print_r($matches);
echo "<br/>";
print_r($matches[0]);  // this matches $subject
echo "<br/>".$result;
questioner
  • 1,144
  • 4
  • 14
  • 22