-1

Detect new node int the DOM with mutantObserver and return an elementHandle trough page.exposeFunction in puppetteer.

Does anyone know if it is possible to work with mutationObserver so that when a new child is created in a part of the DOM it detects it and sends an elementHandle object as a result to a puppeteer exposeFunction method to be processed? I am working with it to extract the context of each node that is added and proceed with a scraping towards it. If someone could help me I would appreciate it. For now I have tried the following: (selector is the selector of each of the cards).

await page.exposeFunction('getItem', function(element) {
      // const elementHandle = await page.evaluateHandle((el) => el, node);
      console.log('Se agrego una tarjeta a la cola.'.blue);
      console.log(element);
      tarjetaQueue.push(element);
      
    });

    await page.evaluate((selector, page) => {
    // Observar el contenedor de tarjetas para detectar nuevas tarjetas agregadas al DOM
    const observer = new MutationObserver(async mutationsList => {
      console.log(selector);
      try{
        for (const mutation of mutationsList) {
          if (mutation.type === 'childList' && mutation.addedNodes.length) {
            for (const node of mutation.addedNodes) {
              const element = node.querySelector(selector);
              getItem(element);
            }
          }
        }
      } catch (error) {
        console.log("Error en mutation observer. ".red + error);
        return null;
      }
    }, selector, page);

    try{
      const contenedorFeed = document.querySelector('#sections[section-identifier="comment-item-section"]:not([static-comments-header]) #contents.style-scope.ytd-item-section-renderer');
      observer.observe(contenedorFeed, { childList: true });
    }catch(error){
      console.log("No se encontraron TARJETONAS. ".red + error);
    }
  });

And what I need from each comment is an elementHandle like the one obtained by doing the following: elementHandle = await comment.$(selector), I don't know if this is possible through mutantObserver, because my code work with those

  • This seems like an [xy problem](https://meta.stackexchange.com/a/233676/399876). Puppeteer adds mutation observers or polls for you when you use `waitForSelector` or `waitForFunction`, so 99.9% of the time you don't need to do it yourself. What are you trying to achieve? It's better to present the site you're scraping and the goal you want, then ask for the best way to do it while showing your attempt as one possibility, rather than omitting context and focusing in too heavily on your specific approach. – ggorlen May 18 '23 at 18:22
  • @ggorlen ty for your comment. I'm trying to scrape comments from a page, this page adds comments every time you scroll down. The idea is that when these new nodes are detected, their context is extracted. If I use waitForSelector, all the elements with said selector will be selected, but I only need the ones that will be created. – Saul Guevara May 18 '23 at 18:35
  • I can add an answer if you can share the site, but yeah, this sort of thing can be done much more simply than manually adding a mutation observer, as I suggested above. For example, [this answer](https://stackoverflow.com/a/73644761/6243352) shows how to scrape a similar scroll feed with `waitForFunction`. Many other approaches are possible, like removing the nodes you've already scraped so you won't re-scrape them. But everything depends on the behavior of the actual site--since every site is unique, it's almost impossble to guess the code that'd work for your case. – ggorlen May 18 '23 at 18:55
  • @ggorlen I'm trying in any YT page comments section. – Saul Guevara May 18 '23 at 19:38
  • Please pick a specific one and show your expected output. If I guess wrong, I've wasted my time and yours. – ggorlen May 18 '23 at 19:38
  • @ggorlen i am using this for my tests: https://youtu.be/XCUZSS54drI – Saul Guevara May 18 '23 at 19:42
  • Okay, thanks, and what output are you expecting exactly? Please [edit] the post to show the data structure you want. – ggorlen May 18 '23 at 19:44
  • And what I need from each comment is an elementHandle like the one obtained by doing the following: elementHandle = await comment.$(selector), I don't know if this is possible through mutantObserver, because my code work with those – Saul Guevara May 18 '23 at 19:52
  • Why do you need element handles? I usually avoid them. Isn't there something you want to do with the element handles? Again, please focus on the [final result](https://meta.stackexchange.com/a/233676/399876) rather than the Puppeteer code you think you may need to get there. – ggorlen May 18 '23 at 19:53
  • I use a loop of instructions that process every comment and scrape its information, only i need its the element handle, sorry if i don't understand, i'm a little noob – Saul Guevara May 18 '23 at 19:57
  • What processing and data do you want? I don't normally use element handles unless I have to, and they don't seem necesssary here. You seem to be confusing the result with the method used to get the result. If you tell me the result you want I can get it in the best method possible. – ggorlen May 18 '23 at 20:43
  • @ggorlen Look, I'll give you an example: this code will generate an array of elementHandle of cards (I understand it that way): cards = await page.$$(selector); Then I have a loop like this that scrapes each one of them: for (const card of cards) { scrapingMethod(); }; What I am looking for is to generate an object like the one that would be generated in: card = await page.$(selector) with the help of mutantObserver I will detect when it is added and generate or extract this object from the DOM, of course, if this is possible. – Saul Guevara May 18 '23 at 21:46
  • I see, you already have the scraping method complete. That's fair enough. I'll get to this when I have time. Thanks for your patience. – ggorlen May 18 '23 at 22:43
  • @ggorlen Thank you very much for your help, I'll wait, no problem. – Saul Guevara May 18 '23 at 23:30

1 Answers1

1

Your goal is not entirely clear and more context would help avoid an xy problem. You're asking about your attempted solution, but it may not be the optimal approach to solve your underlying problem, presumably scraping some data. But based on the discussion, I have a few remarks and can try to point you to a more workable approach:

  • exposeFunction can't work with DOM nodes. DOM nodes aren't serializable, so they deserialize to empty objects by the time you try to work with them in Node. 99.9% of the time, you don't need this function. If you do want to use it to pass data to Node, process the data in the browser so it's serializable.
  • Puppeteer already provides convenient wrappers on top of MutationObserver such as waitForFunction and waitForSelector, so 99.9% of the time use the Puppeteer library instead of installing your own low-level observer. Even when you really need to roll your own listener, you can often get by with a requestAnimationFrame or setTimeout as a polling loop, which is syntactically simpler, assuming performance isn't critical (you can start with a RAF, then upgrade to an observer once you have the basics working).
  • Try to avoid element handles unless you have no other choice or really need trusted events. They're awkward to code with if you need to do anything other than click or type, and can cause race conditions. $$eval and $eval are much more direct ways to extract data.

Now, YouTube is notoriously annoying to scrape and there are thousands of comments on the sample page you've shown. I suggest avoiding their DOM whenever possible, and in this case, it appears to be possible. You can monitor the network requests and capture the JSON payloads that are returned as you scroll down the page. I remove the DOM nodes as I scroll to avoid performance degradation as the comment list gets long.

I'm using a simpler test case so the script will complete in a reasonable amount of time. You might want to write the data to disk as you work, then process it later offline. I'm throwing in some pre-processing to show how you might traverse the result structure to get whatever data you want.

There are a couple of timeouts, not good practice, but I don't have time to replace them with true predicates. I'll revisit this later.

const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^20.2.1
require("util").inspect.defaultOptions.depth = null;

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
  await page.setUserAgent(ua);
  //const url = "https://www.youtube.com/watch?v=XCUZSS54drI"; // the real deal
  const url = "https://www.youtube.com/watch?v=N9RUqGYuGfw"; // smaller test URL
  await page.setRequestInterception(true);

  const results = [];
  let lastChunk = Date.now();
  page.on("response", async res => {
    if (
      !res
        .request()
        .url()
        .startsWith(
          "https://www.youtube.com/youtubei/v1/next?key="
        )
    ) {
      return;
    }

    lastChunk = Date.now();
    const data = await res.json();
    results.push(
      ...data.onResponseReceivedEndpoints
        .flatMap(e =>
          (
            e.reloadContinuationItemsCommand ||
            e.appendContinuationItemsAction
          ).continuationItems.flatMap(
            e => e.commentThreadRenderer?.comment.commentRenderer
          )
        )
        .filter(Boolean)
    );
    console.log("scraped so far:", results.length);
  });

  const blockedResourceTypes = [
    "xhr",
    "image",
    "font",
    "media",
    "other",
  ];
  const blockedUrls = [
    "gstatic",
    "accounts",
    "googlevideo",
    "doubleclick",
    "syndication",
    "player",
    "web-animations",
  ];
  page.on("request", req => {
    if (
      blockedResourceTypes.includes(req.resourceType()) ||
      blockedUrls.some(e => req.url().includes(e))
    ) {
      req.abort();
    } else {
      req.continue();
    }
  });

  await page.goto(url, {
    waitUntil: "domcontentloaded",
    timeout: 60_000,
  });

  // not good, but page context is destroyed with redirects
  await new Promise(r => setTimeout(r, 20_000));

  const scroll = () =>
    page.evaluate(() => {
      const scrollingElement =
        document.scrollingElement || document.body;
      scrollingElement.scrollTop = scrollingElement.scrollHeight;
    });

  lastChunk = Date.now();
  while (Date.now() - lastChunk < 30_000) {
    await scroll();
    await page.$$eval("ytd-comment-thread-renderer", els =>
      els.length > 80 && els.slice(0, -40).forEach(e => e.remove())
    );
  }

  await fs.writeFile(
    "comments.json",
    JSON.stringify(results, null, 2)
  );
  const simpleResults = results.map(e => ({
    author: e.authorText.simpleText,
    text: e.contentText.runs.find(
      e => Object.keys(e).length === 1 && e.text?.trim()
    )?.text,
  }));
  await fs.writeFile(
    "simple-comments.json",
    JSON.stringify(simpleResults, null, 2)
  );
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Note that due to issue #10033, you'll need to avoid Puppeteer versions between 18.2.1 and 20.0.0 to automate YouTube.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • 1
    the issue #10033 you mentioned has been resolved as of puppeteer version `^20.0.0` and up, I've updated my answer here a while a go. [link](https://stackoverflow.com/questions/76054650/puppeteer-tab-crashes-only-on-youtube-com-aw-snap-status-access-violation/76057371#76057371) – idchi May 19 '23 at 09:21
  • @idchi Thanks, updated. It seems like they should close the issue if it's resolved. – ggorlen May 19 '23 at 16:00