Since the update below, Puppeteer has deprecated pierce/
and now offers >>>
and >>>>
combinators for piercing shadow roots (deep and immediate ancestors, respectively), so the code becomes
const hrefs = await page.$$eval(
">>> a[href]",
els => els.map(e => e.getAttribute("href"))
);
Since the original post, Puppeteer has introduced an easier way to traverse the shadow DOM: pierce/
. You can replace the page.evaluate
call in the example below with the simpler:
const hrefs = await page.$$eval(
"pierce/a[href]",
els => els.map(e => e.getAttribute("href"))
);
For posterity, note that text/
also pierces open shadow roots.
As mentioned in the comments, you're dealing with a page that uses shadow roots. Traditional selectors that attempt to pierce shadow roots won't work through the console or Puppeteer without help. Short of using a library, the idea is to identify any shadow root elements by their .shadowRoot
property, then dive into them recursively and repeat the process until you get the data you're after.
This code should grab all of the hrefs on the page (I didn't do a manual count) following this strategy:
const puppeteer = require("puppeteer"); // ^19.11.1
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const url = "https://intersight.com/help/";
const data = await page.goto(url, {
waitUntil: "networkidle0"
});
await page.waitForSelector("an-hulk", {visible: true});
const hrefs = await page.evaluate(() => {
const walk = root => [
...[...root.querySelectorAll("a[href]")]
.map(e => e.getAttribute("href")),
...[...root.querySelectorAll("*")]
.filter(e => e.shadowRoot)
.flatMap(e => walk(e.shadowRoot))
];
return walk(document);
});
console.log(hrefs);
console.log(hrefs.length); // => 45 at the time I ran this
// Bonus example of diving manually into shadow roots...
//const html = await page.evaluate(() =>
// document
// .querySelector("#app > an-hulk")
// .shadowRoot
// .querySelector("#content")
// .shadowRoot
// .querySelector("#main an-hulk-home")
// .shadowRoot
// .querySelector(".content")
// .innerHTML
//);
//console.log(html);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Note that the sidebar and other parts of the page use event listeners on spans and divs to implement links, so these don't count as hrefs as far as the above code is concerned. If you want to access these URLs, there are a variety of strategies you can try, including clicking them and extracting the URL after navigation. This is speculative since it's not clear that you want to do this.
A few remarks about your code:
- Puppeteer wait until page is completely loaded is an important resource.
{ waitUntil: 'domcontentloaded' }
is a weaker condition than { waitUntil: 'networkidle0' }
. Using page.waitForSelector(selector, {visible: true})
and page.waitForFunction(predicate)
are important to use to ensure the elements have been rendered before you begin manipulating them. Even without the shadow root, it's not clear to me that the top-level "an-hulk"
is going to be available by the time you run evaluate
.
- Add console listeners to your page to help debug. Try your queries one step at a time and break them into multiple stages to see where they go wrong.
fs.writeFile
should be await fs.promises.writeFile
since you're in an async function.
Additional resources and similar threads: