Puppeteer: get full HTML content of a webpage, like innerHTML, but including any shadow roots?

Question

When browsing a page in Puppeteer, I can usually get the full HTML content as text like this:

var content = await page.evaluate( 
  () => document.querySelector('body').innerHTML );

However I'm currently dealing with a situation where there are multiple nested shadow roots. So I assume I'll have to traverse the entire DOM and check each node for any .shadowRoot available and traverse those DOMS separately.

Is there a shortcut or simpler way to do this? Like a innerHTML variant that includes any shadowroot DOMs?

just FYI: if the `attachShadow({mode: 'closed'})` was used, `.shadowRoot` won't work neither — Andrea Giammarchi, Jan 21 '21 at 11:14
@AndreaGiammarchi Thanks, don't know yet if that's the case in my particular situation. Actually this whole shadowRoot business is fairly new to me. But in case `mode:'closed'` is used, is there another way to get the HTML content? When I create a screenshot from Puppeteer, all content is there. So one way or another it must have the corresponding DOM objects in there. — RocketNuts, Jan 21 '21 at 11:23
devtools *has* that privilege, but I don't know if it's exported/available in Puppeteer — Andrea Giammarchi, Jan 21 '21 at 11:30

score 1 · Answer 1 · answered May 16 '23 at 20:32

You could try recursively walking the DOM tree and replacing any shadow root HTML with its contents. Rough example:

const puppeteer = require("puppeteer"); // ^20.2.0

const html = `<!DOCTYPE html><html><body>
  <h1>hey</h1>
  <div></div>
  <h2>ok</h2>
<script>
const el = document.querySelector("div");
const root = el.attachShadow({mode: "open"});
el.shadowRoot.innerHTML = \`
  <h1>foo</h1>
  <h1>bar</h1>
  <h1>baz</h1>
\`;
</script>
</body></html>`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const outHtml = await page.evaluate(() => {
    const walk = doc => {
      doc.querySelectorAll("*").forEach(e => {
        if (e.shadowRoot) {
          e.innerHTML = walk(e.shadowRoot);
        }
      });
      return doc.innerHTML;
    };
    return walk(document.body);
  });
  console.log(outHtml);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Related: Puppeteer not giving accurate HTML code for page with shadow roots.

Puppeteer: get full HTML content of a webpage, like innerHTML, but including any shadow roots?

1 Answers1

Linked