0

I asked here a while ago about matching text inside of two wrapped <code>..</code> tags from a string, and it's been working great until somebody had some other HTML wrapped inside the <code> tags.

This is how I'm doing it so far:

preg_match_all("!<code>([^<]*)</code>!", $string, $return_array);

Could anybody improve this regular expression to solve my problem?

Tot Zam
  • 8,406
  • 10
  • 51
  • 76
tarnfeld
  • 25,992
  • 41
  • 111
  • 146
  • 3
    Don't use regex to parse HTML. Period. Use a proper HTML parser. – cdhowie Dec 20 '10 at 10:07
  • 1
    I'll be the first to say - use parser. You will never account for everything people can put there. If you allow html, no regexp wil do. – naugtur Dec 20 '10 at 10:08
  • @cohowie and naugtur, some regex dialects allow for so-called balanced groups, which actually allow proper HTML or XML parsing with some effort. But the PHP flavor doesn't, so that your comment is true in this case. – Lucero Dec 20 '10 at 10:15
  • 1
    You can't parse [X]HTML with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – viam0Zah Dec 20 '10 at 10:18
  • possible duplicate of [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Dec 20 '10 at 10:35

2 Answers2

4

This is one case where I have to agree with the dreaded regex are evil meme. For straightforward extraction purposes, regular expressions are often suitable. But if you want to process malformed and or nested HTML, it's not an option without significant fuss.

Hence I'd recommend using phpQuery or QueryPath for such occasions. It's also pretty simple:

print qp($html)->find("code")->text();
mario
  • 144,265
  • 20
  • 237
  • 291
0

Have you tried this?

preg_match_all("!<code>(.*?)</code>!", $string, $return_array);