preg_match search

Question

<?php
$content = "
{php
    {php 1 php}
    {php 2 php}
    {php 3 php}
php}";

How I can get 4 strings?

First:

{php 1 php}
{php 2 php}
{php 3 php}

Second:

Third:

Four:

Thou shalt not parse HTML using regular expressions. Thou shalt use a [DOM Parser](http://stackoverflow.com/questions/3577641) instead — Pekka, Nov 20 '10 at 18:16
@Isis: Your changes do not change the problem. Regular expressions are only capable of parsing regular languages. — jwueller, Nov 20 '10 at 18:27
If you are insistent on doing that with preg_match(), then you will most likely need to use a recursive regular expression. See: http://php.net/manual/en/regexp.reference.recursive.php — Orbling, Nov 20 '10 at 18:28
So you are not looking to parse HTML, but something entirely different now? Any chance you can convert the data into HTML/XML? Because then you could use a DOM parser — Pekka, Nov 20 '10 at 18:42
@Isis: Could you please use more descriptive titles? [Your previous questions](http://stackoverflow.com/users/263957/isis) all seem to have more or less the same meaningless titles. — Gumbo, Nov 20 '10 at 18:58
@elusive: This is completely irrelevant. Nobody uses REGULAR in that sense. Regular expressions haven’t been REGULAR since Ken Thompson put backrefs into them 40 years ago. Instead of REGULAR, they are instead useful, practical, and powerful. `/(.)\1/` is a regex that is *ispo facto* not a REGULAR language. **BIG DEAL!** Nobody uses REGULAR regular expressions any more. — tchrist, Nov 20 '10 at 20:47
@Orbling: That’s true. Nothing wrong with recursive regular expressions. They work fabulously! — tchrist, Nov 20 '10 at 20:48
@tchrist Indeed I use them fairly often for just such examples as the above, beats building a parser for a simple case. Only the regexp that form them are quite complex and baffle a lot of people, so need commenting well in the code. — Orbling, Nov 20 '10 at 21:24
@Orbling: I used recursion in [this answer](http://stackoverflow.com/questions/4218552/regular-expression-to-match-12345/4219645#4219645) for finding numbers w/descending digits. I’m on bit of a crusade to get people to use all their software engineering skills on regexes just as they would any other code: whitespace for grouping, indentation, cognitive chunking; comments; problem decomposition&topdown programming w/ [grammatical regexes](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491); & alphabetic names for in-regex subs. — tchrist, Nov 20 '10 at 21:37
@Pekka: Sometimes that advice is 100.00000000% right on the money, but often it is not. It isn’t fair to people to just parrot some short refrain as though it were a divine commandment. There should be more resources here than better explain the tradeoffs in a reasoned discussion. It isn’t kind to just tell people what to do without telling them the why and wherefores. No single answer fits all situations. — tchrist, Nov 20 '10 at 21:45
@tchrist look at my very first link. It explains the tradeoffs in a reasoned discussion. If you want to start a reasoned discussion for every one of the "I won't explain what I'm doing, but I want to parse HTML with regular expressions" questions on SO, be my guest. Also, look at the very first revision of the question, in which the OP is using HTML as the example. The use case is *exactly* what a DOM parser was built for. Usually people are just too lazy to use one, and would rather have somebody build a regex for them because it requires less effort — Pekka, Nov 20 '10 at 21:52
@tchrist looking at the OP's question history, though, I'm ready to concede that this may not be the case here - he seems to be building something bigger and more complex — Pekka, Nov 20 '10 at 22:02
@Pekka I recant: your link wasn’t the silly one I thought. True, questions often fail to convey the full circumstances. When you say people are too lazy to use a parser… there’s good-lazy and bad-lazy. I’d call avoiders bad-lazy, since they (may) make more work for themselves, not less, by avoiding parsers. For my real work, I always use them, since I parse tens of thousands of random HTML pages every week. Sometimes I use regexes on my own boiler-plate HTML: it’s faster to write + safe cause it’s constrained as the myriad alien pages can never be. I’m also a lot “regexier” than many querents. — tchrist, Nov 20 '10 at 22:12
@tchrist yeah, and there's nothing bad with using regular expressions on such restricted HTML. Looking at the OP's history, it might even be that he really, really *needs* regexes (he's working on a templating engine of some sort). But that *must* be expressly mentioned in the question - the vast majority of people coming on SO asking for a HTML parsing regex *really* need a DOM parser. It's sometimes a fight to get them to understand that, and people answering here are tired of that fight, which leads to unfriendly (and sometimes unfair) response — Pekka, Nov 20 '10 at 22:17

score 4 · Accepted Answer · answered Nov 20 '10 at 22:13

4

While you could easily parse such input with a simple counter, it is possible to use a recursive regex to get what you want. A simple (?) regex to validate the input would be:

^({php\s*(\d+|(?1)+)\s*php}\s*)$

(?1) is a recursive match, it tries to match the first group again, which is another {php ... php} token. We also have a capturing group between the phps to capture their content.

In your case you want to capture overlapping results (in fact, even results contained within other results). This is even less pretty, but still possible, using a look-ahead. Look-around can have capturing groups, so the pattern would be:

(?=({php\s*(\d+|(?1)+)\s*php}\s*))

The result has a two extra captured groups - blank results for the look around, and the whole token with the outer {php ... php}, but if you use PREG_PATTERN_ORDER your expected results will be on the third postion ([2]):

[2] => Array
(
    [0] => {php 1 php}
           {php 2 php}
           {php 3 php}
    [1] => 1
    [2] => 2
    [3] => 3
)

Here's a bit more complex example: http://ideone.com/sWWrT

Now, the mandatory word of caution. As I've said earlier, this is much more readable and maintainable with a simple depth counter, you don't really need a regex here, beyond recreational use.

answered Nov 20 '10 at 22:13

Kobi

135,331
41
252
292

1

“recreational regexes”—I *like* that! Nice work and thank you. Speaking of a depth counter, by chance are you familiar with [PCRE’s callout mechanism](http://linux.die.net/man/3/pcrecallout)? I wondered whether PHP made use of that somehow; would you know the answer? It corresponds (more or less) to Perl’s `(?{…})` regex code escapes. You could use a callout in the `COND` part of the conditional pattern, `(?(COND)YES_PATTERN|NO_PATTERN)`, to look at your depth counter. `(COND)` can also be a recursion-test like `(R)`, `(R1)`, `(R2)`, or `(R&NAME)`. That doesn’t require callout support. – tchrist Nov 20 '10 at 22:38
@tchrist - Thanks! I believe that's the same as the `e` flag many flavors offer. I'm not familiar with that because I never had a chance to use it - I know little to none PHP, Perl or Python. Also, I was wrong - a counter is good to take one balanced token, but not to collect them on all levels. Either way, I somehow view this as less fun with code blocks. – Kobi Nov 21 '10 at 09:41
@tchrist Did you ever find out if code callouts are available through the `preg_` interface to `PCRE`? – zx81 Jun 24 '14 at 22:14
@tchrist Mmm, trying it now, using `$regex = "~(\d+)(?(?{(strlen($1)==3)})cat)~";` against `000cat`, I get `preg_match(): Compilation failed: assertion expected after (?( at offset 8`... I don't see a callout option in the PCRE CMake file, but they may have turned it off anyway... Or that may have been unidiomatic. :) – zx81 Jun 24 '14 at 22:24

score 0 · Answer 2 · answered Nov 20 '10 at 20:38

0

$regex = preg_match_all("/({php (\d+) php})+/", $content);
$regex[0][0] == "{php 1 php}";
$regex[0][1] == "{php 2 php}";
$regex[0][2] == "{php 3 php}";
end($regex)[0] == "1";
end($regex)[1] == "2";
end($regex)[2] == "3";

Looking for something like this?

answered Nov 20 '10 at 20:38

J V

11,402
10
52
72

I think he wants it to catch the nested cases too. – Orbling Nov 20 '10 at 21:24
@Orbling: Why don’t you show him the recursive regex? I’d do it but my native language is Perl, and I haven’t been able to figure out how to know which version of PCRE that a given php implementation is linked against. *(¿ʍouʞ λpoqλuɐ ƨəop)* PCRE has slightly but subtly different rules about head-vs-tail recursion in its regexes than Perl has, and I’d be afraid of doing it the wrong way. **THANKS!** – tchrist Nov 20 '10 at 21:48
@JV: It’s kinda hard to know what he wants, because he edited away the meat of the question because people were mean and jumped all over him. Maybe he’ll explain better, but I can’t say I’d be surprised if he’d been frightened away. Dunno why but I feel like a softy today; I feel bad for people who genuinely want to learn getting pushed away. – tchrist Nov 20 '10 at 21:51
1

@tchrist : Nobody got mean or agressive, they told him to use a XML parser to parse HTML instead of regexps, which is entirely true. – Vincent Savard Nov 20 '10 at 22:16
1

@Vincent: I think you mean that it’s true that it’s (*usually*) far better to use a proper parser instead of attempting ad-hoc regexes. @Pekka’s voice-of-God Thou-Shalt-Not certainly comes across as hard to my perhaps underjaded eye. I *do* sympathize with the the fatigue that long-timers must feel from having the same old naïve questions asked day in, day out. I really wish there were clear resources available demonstratring where it *does* make sense to use regexes on HTML or XML. Those cases *do* exist, even though I think they must be in the minority. – tchrist Nov 20 '10 at 22:24
@tchrist fair points. But if the thick irony in my voice-of-God isn't obvious (it might not be!), I need to work on my act! :) I'll think of something. – Pekka Nov 21 '10 at 00:49
@tchrist you may enjoy my reference question on the issue: http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php-closed read past the discussion and go to Mario's answer – Pekka Nov 21 '10 at 00:52
@Pekka you made a valiant attempt. And @Mario’s answer was good, too. I think people just don’t really understand how difficult parsing HTML actually is. Did you know that there’s a huge difference between a by-the-rules parser and a useful one? That’s because of all the bad HTML out there. People just have no idea how hard it really is. If my [Oh Yes You Can](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) posting winds up used as a demo of why you shouldn’t even try, I’d not mind at all. I wrote it to show the hardness. – tchrist Nov 21 '10 at 01:44

preg_match search

2 Answers2