4

I am using the code below to scroll all the way to the bottom of a YouTube page and it works. My question is after the site is scrolled down to the bottom how can I console.log that the bottom have been reached?

note: the solution should work with youtube.com. I have already tried getting the document height and compared it with the scroll height but that didn't work!

const puppeteer = require('puppeteer');

let thumbArr = []
const scrapeInfiniteScrollItems = async(page) => {
  while (true) {
    const previousHeight = await page.evaluate(
      "document.querySelector('ytd-app').scrollHeight"
    );
    await page.evaluate(() => {
      const youtubeScrollHeight =
        document.querySelector("ytd-app").scrollHeight;
      window.scrollTo(0, youtubeScrollHeight);
    });
    await page.waitForFunction(
      `document.querySelector('ytd-app').scrollHeight > ${previousHeight}`, {
        timeout: 0
      }
    );

    const thumbnailLength = (await page.$$('ytd-grid-video-renderer')).length
    //this logs the amount of thumbnails every loop but once bottom scroll has        been reached it stops logging (obviously) but the question is how am I supposed to compare the last amount of thumbnail's found with total thumbnails once the loop has stopped running. Take a look below to better understand my question.
    thumbArr.push(thumbnailLength)

    if (thumbnailLength == thumbArr.at(-1)) {
      console.log('bottom has been reached')
    }

    await page.waitForTimeout(1000)
  }
};

(async() => {
  const browser = await puppeteer.launch({
    headless: false
  });
  const page = await browser.newPage();
  await page.goto('https://www.youtube.com', {
    waitUntil: 'networkidle2',
  });

  await scrapeInfiniteScrollItems(page)
})();

UPDATE:

let clientHeightArr = []
let clientHeightArrTracker = []
const scrapeInfiniteScrollItems = async(browser, page) => {
  var infiniteScrollTrackerInterval = setInterval(async() => {
    clientHeightArrTracker.push(clientHeightArr.length)
    if (clientHeightArrTracker.some((e, i, arr) => arr.indexOf(e) !== i) == true) {
      clearInterval(infiniteScrollTrackerInterval)
      console.log('Bottom is reached')
      //causes error "ProtocolError: Protocol error (Runtime.callFunctionOn): Target closed."
      await browser.close()
    }
  }, 2000)
  while (true) {
    const previousHeight = await page.evaluate(
      "document.querySelector('ytd-app').scrollHeight"
    );

    await page.evaluate(() => {
      const youtubeScrollHeight =
        document.querySelector("ytd-app").scrollHeight;
      window.scrollTo(0, youtubeScrollHeight);
    });

    await page.waitForFunction(
      `document.querySelector('ytd-app').scrollHeight > ${previousHeight}`, {
        timeout: 0
      },
    );

    const clientHeight = await page.$$eval("ytd-app", el => el.map(x => x.clientHeight));
    clientHeightArr.push(clientHeight[0])
    await page.waitForTimeout(1000)
  }
};

(async() => {
  const browser = await puppeteer.launch({
    headless: false
  });
  const page = await browser.newPage();
  await page.goto('https://www.youtube.com/c/mkbhd/videos', {
    waitUntil: 'networkidle2',
  });

  await scrapeInfiniteScrollItems(browser, page)
})();
ggorlen
  • 44,755
  • 7
  • 76
  • 106
seriously
  • 1,202
  • 5
  • 23
  • Where are you doing the check you speak of? It should work, maybe with a delta just in case there's a small difference. Print the two values to debug it and debug why it wasn't detecting the end. You could also count the number of video element thumbnails (or whatever) on the page between iterations and if it stops changing, you're done. `await new Promise((resolve) => setTimeout(resolve, 1000));` should be `await page.waitForTimeout(1000)` although almost always, there's a `page.waitForFunction` that's more precise (probably card/thumbnail counting again). – ggorlen Sep 08 '22 at 04:05
  • BTW, depending on what data you're trying to get, you may not need to scroll at all, so the whole thing is often an [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/233676#233676) that can be resolved if you provide context for why you need to scroll in the first place. Often, the data is in a network request or static HTML and you can grab it without much effort. – ggorlen Sep 08 '22 at 04:08
  • @ggorlen this is the data im trying to grab ```const title = await page.$$eval(".ytd-grid-video-renderer #video-title", el => el.map(x => x.getAttribute("title")));``` – seriously Sep 08 '22 at 04:18
  • What page is this on? – ggorlen Sep 08 '22 at 04:47
  • @ggorlen this for example ```https://www.youtube.com/c/mkbhd/videos``` – seriously Sep 08 '22 at 04:50
  • OK, seems pretty straightforward going back to my first suggestion--count the length of the elements after each selection and if it doesn't change after a couple of scroll triggers, you're done. The data is coming in through the /browse endpoint so you could monitor those repsonses if you need more of the data than the title. Or use the YT API. – ggorlen Sep 08 '22 at 05:01
  • @ggorlen I wanted to use the YT api but they require me to either sign up for google workspace or submit my app for verification – seriously Sep 08 '22 at 05:05
  • @ggorlen Can you take a look at my updated code. I have commented my question there. – seriously Sep 08 '22 at 05:25
  • Good start but try saving the last thumbnail length outside the loop (or an array of thumbnails you've collected so far), then compare them. – ggorlen Sep 08 '22 at 05:29
  • @ggorlen you mean like this ( update code)? – seriously Sep 08 '22 at 05:35
  • Sure, something like that, although if the height doesn't change then you'll timeout, and the array should store all of the results you've collected rather than the lengths--old lengths don't matter, only the last one. The goal is to ultimately return an array of all the thumbnail title results, right? So you might as well build that as you go (or do it all at the end if you use a single `previousLength` variable). Some infinite scrollers like Twitter eventually kick out top cards/tweets/thumbnails and so you have to scrape as you scroll. – ggorlen Sep 08 '22 at 05:37
  • @ggorlen see that the problem how can I check if height doesn't change because the only way you can know last height is once loop is done – seriously Sep 08 '22 at 05:40
  • We don't care about the height any more, just the number of cards. Loop goes like this: trigger a scroll which fires off a request. Wait for the response. Check how many cards there are. If there are no new cards since last time, break. Either store the new number of cards or push the new cards onto the result array and repeat. – ggorlen Sep 08 '22 at 05:43
  • OK, good attempt. I'll add an answer tomorrow if you haven't gotten it totally happy with it (feel free to [self answer](https://stackoverflow.com/help/self-answer) if you do get something solid). – ggorlen Sep 08 '22 at 05:45
  • @ggorlen I took your advice and update code above kinda works I didn't even get length of thumbnail. Please review code – seriously Sep 08 '22 at 05:45
  • @ggorlen problem with update code is when I close browser i get error ```ProtocolError: Protocol error (Runtime.callFunctionOn): Target closed.``` – seriously Sep 08 '22 at 05:48

1 Answers1

1

There are a ton of possible approaches here. After messing with it for a bit I arbitrarily landed on the following strategy and decided it was sufficient for starters:

const puppeteer = require("puppeteer"); // ^19.0.0

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36";
  await page.setUserAgent(ua);
  await page.setRequestInterception(true);
  page.on("request", req => {
    if (req.resourceType() === "image") {
      req.abort();
    }
    else {
      req.continue();
    }
  });
  const url = "https://www.youtube.com/c/mkbhd/videos";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  await page.waitForSelector("#video-title");
  let thumbs = [];

  for (;;) {
    try {
      await page.waitForFunction(
        `${thumbs.length} !==
          document.querySelectorAll("#video-title").length`,
        {timeout: 10000}
      );
    }
    catch (err) {
      break;
    }

    thumbs = await page.$$eval("#video-title", els => {
      els.at(-1).scrollIntoView();
      return els.map(e => e.textContent);
    });
  }

  thumbs = thumbs.filter(Boolean);
  console.log("total length", thumbs.length);
  console.log("first 10 thumbs:", thumbs.slice(0, 10));
  console.log("last 10 thumbs:", thumbs.slice(-10));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

total length 1460
first 10 thumbs: [
  'iPhone 14/Pro Impressions: Welcome to Dynamic Island!',
  'Google Pixel Buds Pro Review: Just Get These!',
  'The Hyundai IONIQ 5: I Get It Now!',
  'Android 13 Hands-On: Top 5 Features!',
  'Dope Tech: The Most Extreme Gaming Monitor!',
  'Samsung Z Fold 4/ Flip 4 Impressions + Watch 5 Pro!',
  'Best Back to School Tech 2022!',
  'OnePlus 10T Impressions: Somebody That You Used to Know',
  'Asus Zenfone 9: The New Compact King!',
  "The iPad's Odd New Feature"
]
last 10 thumbs: [
  'HQ Tutorial: Rocket Dock Application',
  'Camstudio Clarity',
  'Tutorial: Camstudio HQ',
  'HQ Tutorial: Ccleaner',
  '15 Year old Golf Swing Analysis',
  'Fraps HD Test in 1080p (18 WOS)',
  'HP Pavilion dv7t Media Center Remote Overview',
  'High fps LG Voyager footage',
  '14 Year knock-down shot (11 Handicap)',
  '13-Year-Old Golf Swing Analysis'
]

This code keeps an array of all thumbs collected so far. In the loop, I first check to see if there are new thumbs--if no new thumbs show up in 10 seconds or so, break and assume we've collected all of the thumbs. Otherwise, scroll the last thumb into view and collect all of the thumbs, extending the result list.

The "wait 10 seconds for new thumbs, then quit" is technically a race condition. If there's some lag, it could trigger a false positive, but it seems good enough for starters. Extending the timeout too generously keeps the script from exiting when the scrape is over, so it's not great for scraping lots of channels with few videos. Maybe better is to wait for the /browse API response after triggering each scroll. This response object has all sorts of additional data in it.

See also Puppeteer - scroll down until you can't anymore

ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • https://stackoverflow.com/questions/16113125/two-semicolons-inside-a-for-loop-parentheses – Andy Jan 08 '23 at 19:57