0

How would I get content from HTML between h3 tags inside an element that has class pricebox? For example, the following string fragment

<!-- snip a lot of other html content -->
<div class="pricebox">
    <div class="misc_info">Some misc info</div>
    <h3>599.99</h3>
</div>
<!-- snip a lot of other html content -->

The catch is 599.99 has to be the first match returned, that is if the function call is

preg_match_all($regex,$string,$matches)

the 599.99 has to be in $matches[0][1] (because I use the same script to get numbers from dissimilar looking strings with different $regex - the script looks for the first match).

Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
DMIL
  • 693
  • 3
  • 7
  • 18
  • 2
    Seriously? Again? [Parsing HTML with regular expressions](http://stackoverflow.com/a/1732454/1023815)? – Adam Zalcman Mar 23 '12 at 00:34
  • 1
    try this for dom manupulation http://simplehtmldom.sourceforge.net/ php has some awesome dom manupulation support as well. mostly good programmers do not recommend using regex for dom pars – Khurram Ijaz Mar 23 '12 at 00:37
  • 1
    Well the answer you point to sounds a bit hysterical. HTML is just a string, it's not magical, and I need to match something between the first pair of h3 tags (again just strings) that come up after a substring 'class="pricebox"'. – DMIL Mar 23 '12 at 00:43
  • Thanks Mian, that sounds useful but I need somethign that is independent of the actual PHP that's doing the parsing - I paste a regex into a CMS and the script uses that regex to get the data. – DMIL Mar 23 '12 at 00:50

1 Answers1

1

Try using XPath; definitely NOT RegEx.

Code :

$html = new DOMDocument();
@$html->loadHtmlFile('http://www.path.to/your_html_file_html');

$xpath = new DOMXPath( $html );
$nodes = $xpath->query("//div[@class='pricebox']/h3");

foreach ($nodes as $node)
{
    echo $node->nodeValue."";
}
Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
  • Thanks, I'll check it out. What I need is to be able to paste a matching pattern into a CMS and have the script handle it, without altering the script in any way for completely different strings. This looks promising. – DMIL Mar 23 '12 at 00:54
  • @DMIL For customiseable query strings regarding HTML parsing, `XPath` is definitely the way to go... (and it's REALLY easy to understand; and much easier to handle than `RegEx`...) – Dr.Kameleon Mar 23 '12 at 00:56
  • But what if there's content between '

    ' tags like '

    only $599.99

    '? How would I get that number with Xpath? I can't use Xpath and then regex because whatever pattern that gets the number needs to be entered in a text field in the CMS. I suppose I could have two fields, one for Xpath pattern, the other for regex to clean up whatever Xpath returns but... that's a pain in the ass too...
    – DMIL Mar 23 '12 at 01:21
  • @DMIL Well, what XPath does is simply to traverse a "branch" of the... HTML tree structure and fetch its value... e.g. `/html/body/div/p/div/h3`. Don't confuse it with RegEx. In your example, XPath would return `only $599.99`, and getting JUST the numeric value would be a whole different issue (that one, probably REQUIRING RegEx...). Seems like a pain in the ass? Probably. But, still it's simpler 'coz you'll be using the different coding techniques for what they were 'designed' for... ;-) – Dr.Kameleon Mar 23 '12 at 01:26