0

I'm using the Simple HTML DOM library on PHP, you can find this library here: Simple HTML DOM

I'm also parsing HTML with Javascript, mainly to isolate the selectors I need.

I first experiment parsing with Javascript (with querySelectorAll) and when it works I re-use the same selector with PHP Simple HTML DOM (with ->find).

Let's take the following page to illustrate the issue: Page from Amazon.fr

$html = file_get_html('PAGE URL FROM EXAMPLE ABOVE');
$products = $html->find('.s-item-container');
for ($z = 0 ; $z < sizeof($products); $z++)
    {
        foreach($products[$z]->find('.a-row .a-spacing-none .s-access-detail-page') as $titles) 
            {
                $title = $titles->plaintext;
            }
    }

This works perfectly fine, if I echo $title, I do get the right title.

Now if I want to capture prices with this:

$products[$z]->find('div > div.a-fixed-left-grid > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span7 > div.a-row.a-spacing-none > a > span.a-size-base.a-color-price.s-price.a-text-bold');

If I display the content of $product[$z], I can see the selector I'm looking for, it's visible in the html but the find function doesn't find it.

I had tried exactly the same thing in Javascript with:

document.querySelectorAll('div > div.a-fixed-left-grid > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span7 > div.a-row.a-spacing-none > a > span.a-size-base.a-color-price.s-price.a-text-bold');

And it works fine by returning this:

enter image description here

I don't understand why it happens because when it works in JS it works in PHP as well, this is the only example where it doesn't work and I don't see what's going wrong.

Do you have any idea?

Thanks ! Laurent

As requested here under, here is a dump: (sorry it's difficult to read)

<div class="s-item-container"><div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:218px"><div class="a-fixed-left-grid-col a-col-left" style="width:218px;margin-left:-218px;float:left;"><div class="a-row"><div aria-hidden="true" class="a-column a-span12 a-text-center"><a class="a-link-normal a-text-normal" href="https://www.amazon.fr/MSI-GEFORCE-GTX-1060-3GT/dp/B01KHWOB78/ref=sr_1_18?ie=UTF8&amp;qid=1533908456&amp;sr=8-18&amp;keywords=GTX+1060"><img src="https://images-eu.ssl-images-amazon.com/images/I/518zhXqNWcL._AC_US218_.jpg" srcset="https://images-eu.ssl-images-amazon.com/images/I/518zhXqNWcL._AC_US218_.jpg 1x, https://images-eu.ssl-images-amazon.com/images/I/518zhXqNWcL._AC_US327_FMwebp_QL65_.jpg 1.5x, https://images-eu.ssl-images-amazon.com/images/I/518zhXqNWcL._AC_US436_FMwebp_QL65_.jpg 2x, https://images-eu.ssl-images-amazon.com/images/I/518zhXqNWcL._AC_US500_FMwebp_QL65_.jpg 2.2935x" width="218" height="218" alt="MSI GEFORCE GTX 1060 3GT OC Carte Graphique , 3 GB" class="s-access-image cfMarker" data-search-image-load></a><div class="a-section a-spacing-none a-text-center"></div></div></div></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:2%;float:left;"><div class="a-row a-spacing-small"><div class="a-row a-spacing-none"><a class="a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal" title="MSI GEFORCE GTX 1060 3GT OC Carte Graphique , 3 GB" href="https://www.amazon.fr/MSI-GEFORCE-GTX-1060-3GT/dp/B01KHWOB78/ref=sr_1_18?ie=UTF8&amp;qid=1533908456&amp;sr=8-18&amp;keywords=GTX+1060"><h2 data-attribute="MSI GEFORCE GTX 1060 3GT OC Carte Graphique , 3 GB" data-max-rows="0" class="a-size-medium s-inline  s-access-title  a-text-normal">MSI GEFORCE GTX 1060 3GT OC Carte Graphique , 3 GB</h2></a></div><div class="a-row a-spacing-none"><span class="a-size-small a-color-secondary">de </span><span class="a-size-small a-color-secondary">MSI</span></div></div><div class="a-row"><div class="a-column a-span7"><div class="a-row a-spacing-none"><a class="a-link-normal a-text-normal" href="https://www.amazon.fr/MSI-GEFORCE-GTX-1060-3GT/dp/B01KHWOB78/ref=sr_1_18?ie=UTF8&amp;qid=1533908456&amp;sr=8-18&amp;keywords=GTX+1060"><span class="a-size-small a-color-secondary"></span><span class="a-size-base a-color-price s-price a-text-bold">EUR 237,36</span></a><span class="a-letter-space"></span><span class="a-size-small a-color-secondary">+ EUR 14,90 Livraison</span></div><div class="a-row a-spacing-mini"><div class="a-row a-spacing-none"><span class="a-size-small a-color-price">Plus que 4 ex. Commandez vite !</span></div></div><div class="a-row a-spacing-mini"><div class="a-row a-spacing-none"><div class="a-row a-spacing-mini"></div><span class="a-size-small a-color-secondary">Autres vendeurs sur Amazon</span></div><div class="a-row a-spacing-none"><a class="a-size-small a-link-normal a-text-normal" href="https://www.amazon.fr/gp/offer-listing/B01KHWOB78/ref=sr_1_18_olp?ie=UTF8&amp;qid=1533908456&amp;sr=8-18&amp;keywords=GTX+1060"><span class="a-color-secondary a-text-strike"></span><span class="a-size-base a-color-price a-text-bold">EUR 176,54</span><span class="a-letter-space"></span>(28 d’occasion & neufs)</a></div></div></div><div class="a-column a-span5 a-span-last"><div class="a-row a-spacing-mini"><span name="B01KHWOB78">      <span class="a-declarative" data-action="a-popover" data-a-popover="{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot;/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=B01KHWOB78&amp;contextId=search&amp;ref=acr_search__popover&quot;}"><a href="javascript:void(0)" class="a-popover-trigger a-declarative"><i class="a-icon a-icon-star a-star-4"><span class="a-icon-alt">4,2 étoiles sur 5</span></i><i class="a-icon a-icon-popover"></i></a></span></span>    <a class="a-size-small a-link-normal a-text-normal" href="https://www.amazon.fr/MSI-GEFORCE-GTX-1060-3GT/dp/B01KHWOB78/ref=sr_1_18?ie=UTF8&amp;qid=1533908456&amp;sr=8-18&amp;keywords=GTX+1060#customerReviews">18</a></div><div class="a-row a-spacing-mini"><span class="a-size-small a-color-secondary a-text-bold">Descriptions du produit</span><br><span class="a-size-small a-color-secondary">... La carte graphique MSI GeForce <em>GTX 1060</em> 3GT OC met la VR (R&eacute;alit ...</span></div></div></div></div></div></div></div>
Laurent
  • 1,465
  • 2
  • 18
  • 41
  • 1
    So what's the result of doing a `var_dump($products[$z]->find('REALLY LONG SELECTOR HERE'));` ? – Patrick Q Aug 10 '18 at 14:36
  • 1
    Pedantic note - using `querySelector()` in a web page loaded by a browser is not "parsing" – Sean Bright Aug 10 '18 at 14:39
  • 2
    Maybe the library you use doesn't support child selector `>`, see [this](http://simplehtmldom.sourceforge.net/manual.htm#section_find), can you verified this ? Maybe with a simple find `div > div` – kip Aug 10 '18 at 14:39
  • In the first example, there are child selectors and it works. – Laurent Aug 10 '18 at 14:40
  • 2
    @Laurent I'm refer to `>` child selector like you can see [here](https://www.w3.org/TR/CSS21/selector.html#child-selectors) not [descendant selector](https://www.w3.org/TR/CSS21/selector.html#descendant-selectors), read please the doc of the library in my prev comment – kip Aug 10 '18 at 14:43
  • @PatrickQ I have added a dump in the initial post as it was too big to put here. As you can see the selector is there. Even if I try to select only .s-price for example, it doesn't return anything with ->find – Laurent Aug 10 '18 at 14:44
  • 2
    Kip is correct. Simple HTML DOM does not support the immediate child selector (`>`). See [this question](https://stackoverflow.com/questions/31977488/simple-html-dom-child-selectors-css) and also [this one](https://stackoverflow.com/questions/26338700/not-able-to-retrieve-direct-child-elements-using-simple-html-dom) – Patrick Q Aug 10 '18 at 14:46
  • 1
    [The regex they are using to break up the selector](https://sourceforge.net/p/simplehtmldom/code/HEAD/tree/trunk/simple_html_dom.php#l689) doesn't match `>` at all. – Sean Bright Aug 10 '18 at 14:51
  • @kip div.product__sales-information > div > strong works (on another page) which would mean > is supported but somehow it doesn't work all the time – Laurent Aug 10 '18 at 14:54
  • 1
    @Laurent It may _appear_ to work, but that's all going to depend on what the source HTML is. The fact is, that selector _is not supported_. On the pages where it appears to work, I bet you get the same result if you just use a space instead of `>`. – Patrick Q Aug 10 '18 at 14:56
  • 1
    I'm prefer this [library](https://github.com/ThomasWeinert/FluentDOM) that support CSS2.1 selectors and CSS3 too, but maybe is more complicated and huge, try with space like @PatrickQ said – kip Aug 10 '18 at 14:59
  • @PatrickQ I have tried as suggested (if I understood you well), removing all > and replaced by a space but the problem still remains. I also replaced the long selector by only one class but it didn't work either. It's like there is something in this HTML that prevents the parsing. – Laurent Aug 10 '18 at 15:03
  • 1
    You did not understand me well. My point is that when you say "which would mean `>` is supported but somehow it doesn't work all the time", you are incorrect. What you are actually seeing is the behavior that you get when using a space, but the source HTML probably only has _one_ child at that level, which produces the same behavior in both cases. I am not saying that replacing the `>` with spaces with give you the result _that you want_. I am saying that they will produce the _same_ results, which is _proof_ that the `>` selector _is not supported_. – Patrick Q Aug 10 '18 at 15:08
  • @PatrickQ Understood, thanks for the explanation. The other example was maybe pure luck. Does the hierarchy needs to be strictly followed or can I leave gaps in the list of children? So far I followed it stricly but it could make my life easier if I wasn't obliged to. – Laurent Aug 10 '18 at 15:26
  • I suggest that you read the comments and answers in the questions that I linked above, as well as review other results that you get when you google something like "php simplehtmldom child selector". There's really nothing I can add that's not already mentioned in them. I also suggest you check out the parser library suggested above by Kip, as Simple HTML DOM just plain might not be the right tool to use for you. – Patrick Q Aug 10 '18 at 15:33
  • [This lib](https://github.com/monkeysuffrage/advanced_html_dom) is a replacement that supports all of those things – pguardiario Aug 10 '18 at 22:52

0 Answers0