-2

I am using a regular expression to extract the price on the right from the following HTML:

<p class="pricing ats-product-price"><em class="old_price">$99.99</em>$94.99</p>

Using preg match in PHP:

preg_match_all('!<p class="pricing ats-product-price"><em class="old_price">.*?<\/em>(.*?)<\/p>!', $output, $prices);

Except, I noticed that sometimes the HTML doesn't include an old price. So sometimes the HTML looks like this:

<p class="pricing ats-product-price">$129.99</p>

It seems like my goal should be to extract the last price from the expression, or in other words the text that directly follows after the last question mark and before the </p>. This sort of expression is way out of my league though - hoping for some help here. Thanks.

Ben86
  • 27
  • 4
  • 3
    don't parse html with regex ...just don't – RomanPerekhrest Jan 31 '18 at 20:42
  • @RomanPerekhrest Any particular reason? I've tried using a couple of different options and I found it the quickest to develop with. What would you recommend using? – Ben86 Jan 31 '18 at 20:43
  • *the quickest* doesn't mean "the best". XML/HTML parsers is the only way for xml/html data – RomanPerekhrest Jan 31 '18 at 20:45
  • Make the old price optional `

    (?:.*?<\/em>)?(.*?)<\/p>` Thatway, it consumes it if there, leaving just the _last_ price.

    –  Jan 31 '18 at 21:56

1 Answers1

1

Use a regular expression in combination with a parser:

<?php

$data = <<<DATA
    <p class="pricing ats-product-price">
        <em class="old_price">$99.99</em>
        $94.99
    </p>
    <p class="pricing ats-product-price">$129.99</p>
DATA;

# set up the dom
$dom = new DOMDocument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

# set up the xpath
$xpath = new DOMXPath($dom);

$regex = '~\$\d+[\d.]*\b\s*\Z~';
foreach ($xpath->query("//p") as $line) {
    if (preg_match($regex, $line->nodeValue, $match)) {
        echo $match[0] . "\n";
    }
}

This yields

$129.99
$129.99


The snippet sets up the DOM, queries it for p tags and searches for the last price within.
See a demo for the expression on regex101.com.
Jan
  • 42,290
  • 8
  • 54
  • 79
  • I think it isn't important to check how looks like the nodeValue with a pattern. Here the main goal is to return the old price when it exists or the price when the old price doesn't exist. You can do it with an XPath query, for example: `//p[./@class[contains(.,"pricing") and contains(.,"ats-product-price")]]/em[contains(@class,"old_price")] | //p[./@class[contains(.,"pricing") and contains(.,"ats-product-price")]][not(./em)]` – Casimir et Hippolyte Jan 31 '18 at 21:13
  • @CasimiretHippolyte: I find mine slightly more readable, admittedly :) But you are right, one could do without a regular expression here. – Jan Jan 31 '18 at 21:14
  • Also, be careful with the option `LIBXML_HTML_NOIMPLIED`, when the document doesn't have a root element, libxml transforms something like `

    ` into `

    ` (it uses the first element and moves silently the closing `` to the end to have a root element).
    – Casimir et Hippolyte Jan 31 '18 at 21:20