Your goal is not entirely clear and more context would help avoid an xy problem. You're asking about your attempted solution, but it may not be the optimal approach to solve your underlying problem, presumably scraping some data. But based on the discussion, I have a few remarks and can try to point you to a more workable approach:
exposeFunction
can't work with DOM nodes. DOM nodes aren't serializable, so they deserialize to empty objects by the time you try to work with them in Node. 99.9% of the time, you don't need this function. If you do want to use it to pass data to Node, process the data in the browser so it's serializable.
- Puppeteer already provides convenient wrappers on top of
MutationObserver
such as waitForFunction
and waitForSelector
, so 99.9% of the time use the Puppeteer library instead of installing your own low-level observer. Even when you really need to roll your own listener, you can often get by with a requestAnimationFrame
or setTimeout
as a polling loop, which is syntactically simpler, assuming performance isn't critical (you can start with a RAF, then upgrade to an observer once you have the basics working).
- Try to avoid element handles unless you have no other choice or really need trusted events. They're awkward to code with if you need to do anything other than click or type, and can cause race conditions.
$$eval
and $eval
are much more direct ways to extract data.
Now, YouTube is notoriously annoying to scrape and there are thousands of comments on the sample page you've shown. I suggest avoiding their DOM whenever possible, and in this case, it appears to be possible. You can monitor the network requests and capture the JSON payloads that are returned as you scroll down the page. I remove the DOM nodes as I scroll to avoid performance degradation as the comment list gets long.
I'm using a simpler test case so the script will complete in a reasonable amount of time. You might want to write the data to disk as you work, then process it later offline. I'm throwing in some pre-processing to show how you might traverse the result structure to get whatever data you want.
There are a couple of timeouts, not good practice, but I don't have time to replace them with true predicates. I'll revisit this later.
const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^20.2.1
require("util").inspect.defaultOptions.depth = null;
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const ua =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
await page.setUserAgent(ua);
//const url = "https://www.youtube.com/watch?v=XCUZSS54drI"; // the real deal
const url = "https://www.youtube.com/watch?v=N9RUqGYuGfw"; // smaller test URL
await page.setRequestInterception(true);
const results = [];
let lastChunk = Date.now();
page.on("response", async res => {
if (
!res
.request()
.url()
.startsWith(
"https://www.youtube.com/youtubei/v1/next?key="
)
) {
return;
}
lastChunk = Date.now();
const data = await res.json();
results.push(
...data.onResponseReceivedEndpoints
.flatMap(e =>
(
e.reloadContinuationItemsCommand ||
e.appendContinuationItemsAction
).continuationItems.flatMap(
e => e.commentThreadRenderer?.comment.commentRenderer
)
)
.filter(Boolean)
);
console.log("scraped so far:", results.length);
});
const blockedResourceTypes = [
"xhr",
"image",
"font",
"media",
"other",
];
const blockedUrls = [
"gstatic",
"accounts",
"googlevideo",
"doubleclick",
"syndication",
"player",
"web-animations",
];
page.on("request", req => {
if (
blockedResourceTypes.includes(req.resourceType()) ||
blockedUrls.some(e => req.url().includes(e))
) {
req.abort();
} else {
req.continue();
}
});
await page.goto(url, {
waitUntil: "domcontentloaded",
timeout: 60_000,
});
// not good, but page context is destroyed with redirects
await new Promise(r => setTimeout(r, 20_000));
const scroll = () =>
page.evaluate(() => {
const scrollingElement =
document.scrollingElement || document.body;
scrollingElement.scrollTop = scrollingElement.scrollHeight;
});
lastChunk = Date.now();
while (Date.now() - lastChunk < 30_000) {
await scroll();
await page.$$eval("ytd-comment-thread-renderer", els =>
els.length > 80 && els.slice(0, -40).forEach(e => e.remove())
);
}
await fs.writeFile(
"comments.json",
JSON.stringify(results, null, 2)
);
const simpleResults = results.map(e => ({
author: e.authorText.simpleText,
text: e.contentText.runs.find(
e => Object.keys(e).length === 1 && e.text?.trim()
)?.text,
}));
await fs.writeFile(
"simple-comments.json",
JSON.stringify(simpleResults, null, 2)
);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Note that due to issue #10033, you'll need to avoid Puppeteer versions between 18.2.1 and 20.0.0 to automate YouTube.