1

I'm a bit of a noob to coding so sorry if this is a dumb question, but I'm trying to write a general purpose scraper for getting some product data using the "schema.org/Product" HTML microdata.

However, I came into an issue when testing (on this page in particular where the name was being set as "Electronics" from the Breadcrumbs schema) as there were ancestor elements with different itemtypes/schema.

I first have this variable declared to check if the page has an element using the Product schema microdata.

var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');

I then wanted to select for all elements with the itemprop attribute. e.g.

productMicrodata.querySelectorAll('[itemprop]');

The issue however is that I want to ignore any elements that have other ancestors with different itemtypes/schema attributes, as in this instance the Breadcrumbs and ListItem schema data is still being included.

I figured I would then just be able to do something like this:

productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');

However this is still returning matches for the child elements having ancestor elements with different itemscope attributes (e.g. breadcrumbs).

I'm sure I'm just missing something super obvious, but any help on how I can achieve only selecting elements that have only the one ancestor with itemtype="http://schema.org/Product" attribute would be much appreciated.

EDIT: For clarification of where the element(s) are that I'm trying to avoid matching with are, here's what the DOM looks like on the example page linked. I'm trying to ignore the elements that have any ancestors with itemtype attributes.

EDIT 2: changed incorrect use of parent to ancestor. Apologies, I am still new to this :|

EDIT 4/SOLUTION: I've found a non-CSS solution for what I'm trying to achieve using the javascript Element.closest() method. e.g.

let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};

for (let i = 0; i < productMicrodata.length; i++) {
    if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
        itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent; 
    }
}

console.log(itemProp);

itemprop elements with itemtype parent attributes

WabiSabi
  • 51
  • 4

2 Answers2

0

:not([itemscope]) [itemprop] means:

An element with an itemprop attribute and any ancestor with no itemprop ancestor.

So:

<div>
    <div itemprop>
        <div itemprop> <!-- this one -->
        </div>
    </div>
</div>

… would match because while the parent element has the attribute, the grandparent does not.

You need to use the child combinator to eliminate elements with matching parent elements:

:not([itemscope]) > [itemprop]
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Thanks so much for the quick response, although the selector you've provided doesn't really address my problems. Apologies as I don't think I've explained well, I've added a screenshot for more clarity. But basically I'm just trying to match any element with an `[itemprop]` attribute, as long as it doesn't have any parents (direct or non-direct) that have an `[itemscope]` attribute if that makes sense. – WabiSabi Mar 10 '20 at 10:38
  • 1
    @WabiSabi — There's no such thing as a "non-direct parent". An element has one parent. If you keep going up then those are ancestors. (The parent is also an ancestor). – Quentin Mar 10 '20 at 10:40
  • Ah yes thanks for the clarification, have updated the post to replace parent with ancestor as that is exactly what I meant. Apologies, still quite new to all this. – WabiSabi Mar 10 '20 at 10:49
  • Just found one of your posts [here](https://stackoverflow.com/a/54331991/10053617) from a while back that I think addresses what I'm trying to do. Is it still the case that this isn't possible with CSS selectors? – WabiSabi Mar 10 '20 at 10:56
0

[...] help on how I can achieve only selecting elements that have only the itemtype="http://schema.org/Product" attribute would be much appreciated.

Attribute selectors can take explicit values:

[myAttribute="myValue"]

So the syntax for this would be:

var productMicrodata.querySelectorAll('[itemtype="http://schema.org/Product"]');
Rounin
  • 27,134
  • 9
  • 83
  • 108
  • Thanks so much for your answer. I think you've missed the issue I'm having however. Please see the edited post and comment on @quentin 's response for as to what I'm trying to achieve. To copy and paste from that comment, " I'm just trying to match any element with an [itemprop] attribute, as long as it doesn't have any parents (direct or non-direct) that have an [itemscope] attribute" – WabiSabi Mar 10 '20 at 10:41