I want to strip some elements and comments from the DOM within Puppeteer. These items do not have identifiable IDs, classes, or attributes which I can select using CSS. However, they may be identified by internal strings, and some elements may be wrapped in human-readable comments. My attempts so far:
- Using CSS selectors does not seem possible, since they only work with an ID or class: there is no CSS
contains()
selector. So I tried to do it with XPath... - Some elements may be selected (and potentially removed?) using XPath, but I'm a rookie with both Puppeteer and XPath. I have provided my aborted attempt below.
- I might instead use a regular expression, but I don't know how to remove strings from the DOM after its HTML has been parsed.
Any ideas? Thanks.
So, in the following example, I would like to delete the elements between the <!-- DELETE ME ... -->
comments, as well as the <!-- DELETE ME ... -->
comments at the end:
<html>
<head>
<!-- DELETE ME BEGIN -->
<script>
// delete me
console.log('delete me')
</script>
<!-- DELETE ME END -->
<title>Page Title</title>
</head>
<body>
<!-- DELETE ME BEGIN -->
<style>
body {
/* delete me */
color: red;
}
</style>
<script>
// delete me
console.log('delete me')
</script>
<!-- DELETE ME END-->
<style>
body {
/* keep me */
color: green;
}
</style>
<script>
// keep me
console.log("keep me")
</script>
<p>Keep me</p>
<!-- keep me -->
</body>
</html>
<!-- DELETE ME -->
<!-- DELETE ME TOO -->
Puppeteer/XPath code (just an attempt, does not yet do anything):
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on("console", (log) => console[log._type](log._text));
const html = await page.evaluate(() => {
var evaluator = new XPathEvaluator();
var result = evaluator.evaluate(
"//script[contains(.,'delete me')]",
document,
null,
XPathResult.ANY_TYPE
);
console.log(result);
return document.documentElement.outerHTML;
});
await browser.close();