1

I'd like to preg_match all div ids following the p tag with name="groups". How do I write this expression in PHP? (The html is malformed so I can't use XPath ...)

<p name="groups">
  <div id="55">fifty-five</div>
  <div id="65">sixty-five</div>
  <div id="75">seventy-five</div>
</p>

The ideal output would be something like:

  array
    55
    65
    75

  array
    fifty-five
    sixty-five
    seventy-five
dani
  • 4,880
  • 8
  • 55
  • 95
  • Please post the code that uses XPath and fails. – Tomalak Apr 24 '12 at 14:09
  • 1
    `The html is malformed so I can't use XPath` - and you think regex will play nicer? Think again... – DaveRandom Apr 24 '12 at 14:13
  • [don't use regex to parse html](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – JKirchartz Apr 24 '12 at 14:14
  • Can't you just point me to a decent "match anything after this" resource? I know regex will work and not XPath - that is why I included it in the question ... – dani Apr 24 '12 at 14:14
  • The thought that regexes will work is self-deception in general case. There's always a relatively high possibility that small changes in the HTML will result in unpredictable behaviour. – Exander Apr 24 '12 at 14:23
  • @dani *"Can't you just point me to a decent "match anything after this" resource?"* - No. *" I know regex will work and not XPath"* - wrong again. You just have not tried enough to get XPath to work. – Tomalak Apr 24 '12 at 14:27
  • following the p tag? no way.maybe you meant children or descendants of the p tag? – goat Apr 24 '12 at 14:35
  • 1
    btw-the dom extension recovers from soup html pretty well. you can also run the html through the html tidy extension first. – goat Apr 24 '12 at 14:36

1 Answers1

0

Whilst using regex to parse html is usually not good, using it to match certain pieces of html for limited jobs can be fine.

Here's the regex:

<p name="groups">(\s*<div id="([0-9]+)">([a-z\-]+)</div>)+\s*</p>
HappyTimeGopher
  • 1,377
  • 9
  • 14