1

I am trying to download the HTML code for the website intersight.com/help/. But puppeteer is not returning the HTML code with hrefs as we can see in the page (example https://intersight.com/help/getting_started is not present in the downloaded HTML). On inspecting the HTML in browser I came to know that all the missing HTML is present inside the <an-hulk></an-hulk> tags. I don't know what these tags mean.

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const data = await page.goto('https://intersight.com/help/', { waitUntil: 'domcontentloaded' });
  // Tried all the below lines, neither worked
  // await page.waitForSelector('.helplet-links')
  // document.querySelector("#app > an-hulk").shadowRoot.querySelector("#content").shadowRoot.querySelector("#main > div > div > div > an-hulk-home").shadowRoot.querySelector("div > div > div:nth-child(1) > div:nth-child(1) > div.helplet-links > ul > li:nth-child(1) > a > span")
  // await page.evaluateHandle(`document.querySelector("#app > an-hulk").shadowRoot.querySelector("#content").shadowRoot.querySelector("#main > div > div > div > an-hulk-home")`);
  await page.evaluateHandle(`document.querySelector("an-hulk").shadowRoot.querySelector("#aside").shadowRoot.querySelectorAll(".item")`)
  const result = await page.content()
  fs.writeFile('./intersight.html', result, (err) => {
    if (err) console.log(err)
    else console.log('done!!')
  })
  // console.log(result)
  await browser.close();
})();
Rajat
  • 81
  • 5
  • Start by using `page.goto('https://intersight.com/help/', {waitUntil: "networkidle0"});`, then try ideas in the linked thread. You can wait for that element or text specifically, for example, with `page.waitForSelector` or spin/poll on an arbitrary predicate with `waitForFunction` that might, for example, count the children of the list until it matches your expectation. Then call `page.content`. – ggorlen Jul 26 '21 at 15:55
  • @ggorlen I have tried with networkidle0 , networkidle2, domcontentloaded, none seems to work also I have used query selector as suggested here [link](https://github.com/puppeteer/puppeteer/issues/7435). This also failed. – Rajat Jul 27 '21 at 04:31
  • It looks like these are [shadow roots](https://stackoverflow.com/questions/34119639/what-is-shadow-root). If you want to get the entire HTML, you'll probably have to traverse into the shadow roots and connect each one with its parent to reconstruct the HTML. Do you really want the whole thing verbatim or are you just looking for certain pieces of data? `document.querySelector("an-hulk").shadowRoot.querySelector("#aside").shadowRoot.querySelectorAll(".item")` gets the menu data, for example, so accessing `.shadowRoot` recursively seems like the way for a full DOM traversal, or use a lib. – ggorlen Jul 27 '21 at 05:12
  • See also [How to get text from shadow root element?](https://stackoverflow.com/questions/63573594/how-to-get-text-from-shadow-root-element) and [Puppeteer: get full HTML content of a webpage, like innerHTML, but including any shadow roots?](https://stackoverflow.com/questions/65826064/puppeteer-get-full-html-content-of-a-webpage-like-innerhtml-but-including-any) – ggorlen Jul 27 '21 at 05:20
  • @ggorlen updated code. – Rajat Jul 27 '21 at 07:03
  • Thanks. Would you mind specifying what data you're hoping to get, exactly? – ggorlen Jul 27 '21 at 07:04
  • All the hrefs in the page. – Rajat Jul 27 '21 at 07:07

1 Answers1

2

Since the update below, Puppeteer has deprecated pierce/ and now offers >>> and >>>> combinators for piercing shadow roots (deep and immediate ancestors, respectively), so the code becomes

const hrefs = await page.$$eval(
  ">>> a[href]",
  els => els.map(e => e.getAttribute("href"))
);

Since the original post, Puppeteer has introduced an easier way to traverse the shadow DOM: pierce/. You can replace the page.evaluate call in the example below with the simpler:

const hrefs = await page.$$eval(
  "pierce/a[href]",
  els => els.map(e => e.getAttribute("href"))
);

For posterity, note that text/ also pierces open shadow roots.


As mentioned in the comments, you're dealing with a page that uses shadow roots. Traditional selectors that attempt to pierce shadow roots won't work through the console or Puppeteer without help. Short of using a library, the idea is to identify any shadow root elements by their .shadowRoot property, then dive into them recursively and repeat the process until you get the data you're after.

This code should grab all of the hrefs on the page (I didn't do a manual count) following this strategy:

const puppeteer = require("puppeteer"); // ^19.11.1

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://intersight.com/help/";
  const data = await page.goto(url, {
    waitUntil: "networkidle0"
  });
  await page.waitForSelector("an-hulk", {visible: true});
  const hrefs = await page.evaluate(() => {
    const walk = root => [
      ...[...root.querySelectorAll("a[href]")]
        .map(e => e.getAttribute("href")),
      ...[...root.querySelectorAll("*")]
        .filter(e => e.shadowRoot)
        .flatMap(e => walk(e.shadowRoot))
    ];
    return walk(document);
  });
  console.log(hrefs);
  console.log(hrefs.length); // => 45 at the time I ran this

  // Bonus example of diving manually into shadow roots...
  //const html = await page.evaluate(() =>
  //  document
  //    .querySelector("#app > an-hulk")
  //    .shadowRoot
  //    .querySelector("#content")
  //    .shadowRoot
  //    .querySelector("#main an-hulk-home")
  //    .shadowRoot
  //    .querySelector(".content")
  //    .innerHTML
  //);
  //console.log(html);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Note that the sidebar and other parts of the page use event listeners on spans and divs to implement links, so these don't count as hrefs as far as the above code is concerned. If you want to access these URLs, there are a variety of strategies you can try, including clicking them and extracting the URL after navigation. This is speculative since it's not clear that you want to do this.


A few remarks about your code:

  • Puppeteer wait until page is completely loaded is an important resource. { waitUntil: 'domcontentloaded' } is a weaker condition than { waitUntil: 'networkidle0' }. Using page.waitForSelector(selector, {visible: true}) and page.waitForFunction(predicate) are important to use to ensure the elements have been rendered before you begin manipulating them. Even without the shadow root, it's not clear to me that the top-level "an-hulk" is going to be available by the time you run evaluate.
  • Add console listeners to your page to help debug. Try your queries one step at a time and break them into multiple stages to see where they go wrong.
  • fs.writeFile should be await fs.promises.writeFile since you're in an async function.

Additional resources and similar threads:

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thanks for the elaborated answer. Just a query, is there a way I can get all the shadow elements without using the querySelector. Basically I want the fully rendered HTML of the page with all the elements. – Rajat Jul 27 '21 at 08:17
  • It's a single page app (i.e. mostly JS), so even if you were to get all of the HTML, it doesn't seem likely that you'd be able to do anything useful with it. Your comment said you wanted the hrefs so I'm not really sure what you're trying to do. Why do you want all of the HTML, exactly? There's probably a better approach than trying to rebuild the whole site. – ggorlen Jul 27 '21 at 08:24
  • Actually the thing is I have a code that filters a[hrefs] from HTML code. I want to make my code generic so that if page contains shadow resources it will also work with the existing code. – Rajat Jul 27 '21 at 08:47
  • It does. But my current code is used for crawling and the code must be generic so that it works for all the websites. This current code is dependent on query selector that will fail for other pages. Can we find a way to make it generic? – Rajat Jul 28 '21 at 05:08
  • It's important to clearly explain all requirements up front. Otherwise it's just moving the goalpost, which makes it very difficult to provide an answer without a lot of back and forth and revisions. That said, this is as generic as I can offer other than the largely cosmetic `await page.waitForSelector("an-hulk", {visible: true});`. but the problem with web scraping is that it's almost impossible to write purely generic code. There are so many edge cases that you can't claim something works on "all the websites". I'm still not really sure what your actual use case is. – ggorlen Jul 28 '21 at 05:18
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235370/discussion-between-rajat-and-ggorlen). – Rajat Jul 28 '21 at 06:53