strip DOM elements in Puppeteer, without using CSS selectors?

Question

I want to strip some elements and comments from the DOM within Puppeteer. These items do not have identifiable IDs, classes, or attributes which I can select using CSS. However, they may be identified by internal strings, and some elements may be wrapped in human-readable comments. My attempts so far:

Using CSS selectors does not seem possible, since they only work with an ID or class: there is no CSS contains() selector. So I tried to do it with XPath...
Some elements may be selected (and potentially removed?) using XPath, but I'm a rookie with both Puppeteer and XPath. I have provided my aborted attempt below.
I might instead use a regular expression, but I don't know how to remove strings from the DOM after its HTML has been parsed.

Any ideas? Thanks.

So, in the following example, I would like to delete the elements between the  comments, as well as the  comments at the end:

    <html>
      <head>

        <!-- DELETE ME BEGIN -->
        <script>
          // delete me
          console.log('delete me')
        </script>
        <!-- DELETE ME END -->

        <title>Page Title</title>
      </head>

      <body>
        
        <!-- DELETE ME BEGIN -->
        <style>
          body {
            /* delete me */
            color: red;
          }
        </style>
        <script>
          // delete me
          console.log('delete me')
        </script>
        <!-- DELETE ME END-->

        <style>
          body {
            /* keep me  */
            color: green;
          }
        </style>

        <script>
          // keep me
          console.log("keep me")
        </script>
        <p>Keep me</p>
        <!-- keep me -->

      </body>
    </html>

    <!-- DELETE ME -->
    <!-- DELETE ME TOO -->

Puppeteer/XPath code (just an attempt, does not yet do anything):

    const browser = await puppeteer.launch();
    
    const page = await browser.newPage();
    page.on("console", (log) => console[log._type](log._text));

    const html = await page.evaluate(() => {
      var evaluator = new XPathEvaluator();
      var result = evaluator.evaluate(
        "//script[contains(.,'delete me')]",
        document,
        null,
        XPathResult.ANY_TYPE
      );

      console.log(result);

      return document.documentElement.outerHTML;
    });

    await browser.close();

score 1 · Accepted Answer · answered Jan 05 '21 at 18:10

1

Your xpath looks correct. Puppeteer provides page .$x (expression) functions to run the xpath:

const browser = await puppeteer.launch();

const page = await browser.newPage();
await page.goto('https://storm-bald-meteorology.glitch.me');

let xs = await page.$x("//script[contains(. ,'delete me')]");
console.log(xs.length);
for (let x of xs) {
  let txt = await page.evaluate(el => el.innerText, x);
  console.log(txt);
}

await browser.close();

You can copy/paste this code into puppeteer playground to try it. I have also put your html on glitch.

answered Jan 05 '21 at 18:10

Sam R.

16,027
12
69
122

Thanks for taking a look @sam-r ! Your code is better than mine. Do you also happen to know how I might delete the DOM elements? – allanberry Jan 05 '21 at 19:30
1

@aljabear, Just call `el.remove()` instead of `el.innerText`. – Sam R. Jan 05 '21 at 19:48
1

Thank you, I really appreciate it! :) – allanberry Jan 05 '21 at 22:28

allanberry · Answer 2 · 2021-01-07T16:29:40.393

Note for future self, here is the full code I wrote incorporating @sam-r solution, in this case stripping elements added to rendered Wayback Machine entry:

   // remove elements by XPath
    [
      ...await page.$x("//script[contains(.,'__wm')]"),
      ...await page.$x("//script[contains(.,'archive.org')]"),
      ...await page.$x("//style[contains(.,'margin-top:0 !important;\n  padding-top:0 !important;\n  /*min-width:800px !important;*/')]"),
      ...await page.$x("//comment()[contains(.,'WAYBACK')]"),
      ...await page.$x("//comment()[contains(.,'Wayback')]"),
      ...await page.$x("//comment()[contains(.,'playback timings (ms)')]"),
    ].forEach(async xpath => await page.evaluate(el => el.remove(), xpath));

    // remove elements by CSS Selector
    await page.evaluate(async () => {
      [
        document.querySelector('link[href*="/_static/css/banner-styles.css"]'),
        document.querySelector('link[href*="/_static/css/iconochive.css"]'),
        ...document.querySelectorAll("#wm-ipp-base"), // wayback header
        ...document.querySelectorAll('script[src*="wombat.js"]'),
        ...document.querySelectorAll('script[src*="archive.org"]'),
        ...document.querySelectorAll('script[src*="playback.bundle.js"]'),
        ...document.querySelectorAll("#donato"), // wayback donation header
      ].forEach((element) => element.remove());
    });

strip DOM elements in Puppeteer, without using CSS selectors?

2 Answers2