Regex match HTML wrapped around HTML

Question

I asked here a while ago about matching text inside of two wrapped <code>..</code> tags from a string, and it's been working great until somebody had some other HTML wrapped inside the <code> tags.

This is how I'm doing it so far:

preg_match_all("!<code>([^<]*)</code>!", $string, $return_array);

Could anybody improve this regular expression to solve my problem?

Don't use regex to parse HTML. Period. Use a proper HTML parser. — cdhowie, Dec 20 '10 at 10:07
I'll be the first to say - use parser. You will never account for everything people can put there. If you allow html, no regexp wil do. — naugtur, Dec 20 '10 at 10:08
@cohowie and naugtur, some regex dialects allow for so-called balanced groups, which actually allow proper HTML or XML parsing with some effort. But the PHP flavor doesn't, so that your comment is true in this case. — Lucero, Dec 20 '10 at 10:15
You can't parse [X]HTML with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — viam0Zah, Dec 20 '10 at 10:18
possible duplicate of [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Dec 20 '10 at 10:35

score 4 · Accepted Answer · answered Dec 20 '10 at 10:13

This is one case where I have to agree with the dreaded regex are evil meme. For straightforward extraction purposes, regular expressions are often suitable. But if you want to process malformed and or nested HTML, it's not an option without significant fuss.

Hence I'd recommend using phpQuery or QueryPath for such occasions. It's also pretty simple:

print qp($html)->find("code")->text();

score 0 · Answer 2 · answered Dec 20 '10 at 10:08

0

Have you tried this?

preg_match_all("!<code>(.*?)</code>!", $string, $return_array);

answered Dec 20 '10 at 10:08

Jamie McElwain

51
1
2

Regex match HTML wrapped around HTML

2 Answers2

Linked