0

I am trying to extract prices for a product from a webpage using a php script. The string in question consists of the following html:

<div class="pd_warranty col-xs-12 no-padding">
    <p class="selectWty txtLeft">Available Options</p>
    <div class="vspace clear"></div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/single” class="selected">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">Single</p>
                <p class="noMar txtLeft sml">$99.99</p>
            </div>
        </div>
    </a>
</div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/2pack” class="">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">2-PACK</p>
                <p class="noMar txtLeft sml">$159.99</p>
            </div>
        </div>
    </a>
</div>

<div class="subProd col-xs-4 noPadLR">
    <a href="https://www.example.com/4pack” class="">
        <div class="col-xs-12 cellTable pad5All">
            <div class="col-xs-8 noPadLR cellTableCell">
                <p class="noMar txtLeft">4-PACK</p>
                <p class="noMar txtLeft sml">$249.99</p>
            </div>
        </div>
    </a>
</div>

</div> 

There are three groups of prices on most products: Single 2-PACK 4-PACK

Some pages may not have one or both 2-PACK or 4-PACK.

I failed attempting to write a regex expression to extract the info I need from a variable with the above string. I am trying to make a php regex expression to extract the words single/2-pack/4-pack and price in an array[type][price] to represent if each type is present in the html with price.

Any help with the regex expression would be greatly appreciated.

Ryan A
  • 103
  • 1
  • 7
  • Why regex? Why not another way? – apokryfos May 22 '18 at 12:33
  • 2
    Possible duplicate of [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) – Justinas May 22 '18 at 12:34
  • Use a parser. Iterate over all divs with class `subProd` then in that enter the `a/div/div/`and pull both `p`s to get the price and descrip. Class `sml` seems to be the price. – user3783243 May 22 '18 at 12:43
  • 1
    [**TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ**](https://stackoverflow.com/a/1732454/1954610) ... Do not use regex to parse HTML. – Tom Lord May 22 '18 at 12:49
  • Looks like no coding attempt. Use a parser. Start researching. SO is not a free coding service. Xpath is your friend. – mickmackusa May 22 '18 at 12:50
  • I tried XPTAH in google sheets after extracting the XPATH from inspect element and it says no data as the result. I spent over 2 hours playing with xpath before giving up and trying to do this in php. I am not asking for a coding service. I am asking for help with the regex expression. I could use strpos to do this as well, but I thought regex would be a cleaner way should the code stay the same on the page being scanned. – Ryan A May 22 '18 at 17:56
  • Geez, check out the disclaimer on Jakub's answer. Definitely don't use that. Who wants to go on in their career wondering if a program actually works from day to day. If you tried XPath then you were doing the right thing. 2 hours using the right tool without a result is time well spent -- don't surrender. Edit your question and remove all mention of regex, show your xpath attempt; we can help you. I am not the only one urging you to do the right thing, read the comments. If you think it is rude to change your question, you can always delete it and post a new one. – mickmackusa May 22 '18 at 20:39
  • Furthermore, your question is not for your eyes only. Thousands of future researchers will visit this StackOverflow page and be encouraged to mimic your technique. Don't you agree that it would be best to present best practices? – mickmackusa May 22 '18 at 20:41
  • This is no time to try to pop a pimple with boltcutters. Good results start from selecting the right tool. This is not bullying, this is professional encouragement. – mickmackusa May 22 '18 at 20:43
  • I'm not just one of these users that comments "use a parser" and leaves. I post solutions in regex and domdocument/xpath depending on what is better suited. Here to help (when I'm not at work). – mickmackusa May 22 '18 at 20:55

2 Answers2

1

Note that parsing html with regular expressions is fragile and will break most of the times html changes. You'll need to constantly compromise between being too specific with your matching and too open.

Here it is:

$pattern = '#<div class="subProd.*?<p class="noMar[^>]+>(?P<product>[^<]+).*?<p class="noMar[^>]+>(?P<price>[^<]+)<#smi';
if (preg_match_all($pattern, $html, $matches)) {
    $products = array_combine($matches['product'], $matches['price']);

    var_dump($products);
}

Will dump:

array(3) {
   ["Single"]=> string(6) "$99.99"
   ["2-PACK"]=> string(7) "$159.99"
   ["4-PACK"]=> string(7) "$249.99"
}

Pattern explained:

  • # is a pattern separator.
  • <div class="subProd will match the string literally.
  • .*? will match any character any number of times, but will not be gready. It means it will match the shortest string till the next matching part of the pattern.
  • <p class="noMar will match the string literally.
  • [^>]+> is a character group. It will match any character but > at least once, until it finds a >.
  • (?P<product>[^<]+) is a named capture group (inside ()). It makes that your match is available under the product key in $matches later. It will match any character but < at least once.
  • .*? any character not gready.
  • <p class="noMar literal string.
  • [^>]+> any character but > until >
  • (?P<price>[^<]+)< any character but < until <. The part until < will be captured in the price group.
Jakub Zalas
  • 35,761
  • 9
  • 93
  • 125
1

There will be many ways to customize the xpath and iterated node handling, but this does work on your sample string. You can refine this solution to be more or less strict depending on your needs.

(Jakub forced me to post this answer, since I don't want you to have to resort to regex.)

Code: (Demo)

$dom = new DOMDocument; 
$dom->loadHTML(str_replace ('”', '"', $html));  // normalize the quoting; extend as needed
$xpath = new DOMXPath($dom);
//                        actually targeting this div ---------vvv
foreach ($xpath->evaluate("//div[contains(@class, 'subProd')]//div[contains(p/@class, 'noMar')]") as $div) {
    $type = $xpath->query("p[contains(@class, 'noMar') and not(contains(@class, 'sml'))]", $div)[0]->nodeValue;
    $price = $xpath->query("p[contains(@class, 'noMar') and contains(@class, 'sml')]", $div)[0]->nodeValue;
    $result[$type] = $price;
}
var_export($result);

Output:

array (
  'Single' => '$99.99',
  '2-PACK' => '$159.99',
  '4-PACK' => '$249.99',
)

To explain...

The input for the foreach() is targeting the div that has one or more children with class attribute noMar. For every qualifying div found in the html...

  • the type text if extracted from the p element with a class that has noMar but not sml
  • the price text if extracted from the p element with a class that has noMar and sml

I am storing the extracted data as a one-dimensional associative array.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • Thank you for your commitment to helping users on stackoverflow. I wasn't aware xpath commands could be used in php. I was trying to use the importxml function in google sheets. That's why I asked the question about regex. I didn't understand one part of your answer, could you explain what this does: $results[$node->childNodes[1]->nodeValue] = $node->childNodes[3]->nodeValue; – Ryan A May 24 '18 at 03:09
  • I am at work so I must be brief. Those are the non-whitespace "nodes". There are 5 total nodes in the target div. You want the 2nd and the 4th. The node keys are zero indexed. On the first iterated div, childnode[1] is `Single`. – mickmackusa May 24 '18 at 03:11
  • There are tricks to remove all needless whitespace nodes from the document (that would be best) but I couldn't get the techniques to work with my code / your sample ...in the time that I had. – mickmackusa May 24 '18 at 03:15
  • @RyanA see the newline and consecutive spaces represented in this demo: https://3v4l.org/UiShI – mickmackusa May 24 '18 at 03:23
  • @RyanA I've refined my answer a bit. Does this all make sense? – mickmackusa May 24 '18 at 22:11
  • Sorry been really busy with a newborn. Will get back to you next week when I get back to this. Thank you for your help. – Ryan A May 29 '18 at 13:56