0

I am trying to use puppeteer to try and scroll all the way to the bottom of the site but the code I am using is not working. What I did was set a while loop then check if new height equals previous height then set a promise but for some reason it is not working. Where did I go wrong and how can I fix it. Thanks in advance.

const puppeteer = require('puppeteer');

const scrapeInfiniteScrollItems = async(page) => {
  while (true) {
    previousHeight = await page.evaluate('document.body.scrollHeight')
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
    await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`)
    await new Promise((resolve) => setTimeout(resolve, 1000));
  }
}


(async() => {
  const browser = await puppeteer.launch({
    headless: false
  });
  const page = await browser.newPage();
  await page.goto('https://www.youtube.com', {
    waitUntil: 'networkidle2',
  });

  await scrapeInfiniteScrollItems(page)
})();
ggorlen
  • 44,755
  • 7
  • 76
  • 106
seriously
  • 1,202
  • 5
  • 23
  • https://www.npmjs.com/package/puppeteer-autoscroll-down. this will be helpful I think. – Kashif Ghafoor Sep 07 '22 at 14:48
  • @KashifGhafoor I tried it didn't help – seriously Sep 07 '22 at 14:48
  • your code is working fine on this site. https://scrollmagic.io/examples/advanced/infinite_scrolling.html. may be problem is with youtube. – Kashif Ghafoor Sep 07 '22 at 15:06
  • @KashifGhafoor yeah definitely something up with youtube. I tried it on another site too. – seriously Sep 07 '22 at 15:13
  • What data are you trying to get ultimately? You can often get it by other means than scrolling, like monitoring a network request. If you need to scroll, there's a canonical thread with many additional techniques you'll probably want to try: [Puppeteer - scroll down until you can't anymore](https://stackoverflow.com/questions/51529332/puppeteer-scroll-down-until-you-cant-anymore?rq=1) – ggorlen Sep 07 '22 at 15:20
  • ` const scrapeInfiniteScrollItems = async (page: puppeteer.Page) => { while (true) { await page.evaluate(() => { window.scrollBy(0, window.outerHeight); }); await new Promise((resolve) => setTimeout(resolve, 5000)); } }; ` – Kashif Ghafoor Sep 07 '22 at 15:20
  • The code above is working on youtube. I've tried. – Kashif Ghafoor Sep 07 '22 at 15:21
  • @ggorlen I have been trying to find a way to contact you for a week. you haven't mentioned anything on GitHub or your portfolio link. I am a computer science student and want to have a discussion with you. I don't know what medium of communication you like here is my email kashifghafoor140@gmail.com. if you can ping me. waiting for your response. – Kashif Ghafoor Sep 07 '22 at 15:24

1 Answers1

1

In case of youtube the height of body is 0 that's why your function is not working. If we see in devtools on youtube the whole content is in ytd-app element.

So we should use document.querySelector('ytd-app').scrollHeight instead of document.body.scrollHeight to scroll down to bottom.

working code.

const scrapeInfiniteScrollItems = async (page: puppeteer.Page) => {
  while (true) {
    const previousHeight = await page.evaluate(
      "document.querySelector('ytd-app').scrollHeight"
    );
    await page.evaluate(() => {
      const youtubeScrollHeight =
        document.querySelector("ytd-app").scrollHeight;
      window.scrollTo(0, youtubeScrollHeight);
    });
    try {
      await page.waitForFunction(
        `document.querySelector('ytd-app')?.scrollHeight > ${previousHeight}`,
        { timeout: 5000 }
      );
    } catch {
      console.log("done");
      break;
    }
    await new Promise((resolve) => setTimeout(resolve, 1000));
  }
};
Kashif Ghafoor
  • 306
  • 2
  • 10
  • yeah this works great. How can I know when the bottom have been reached? – seriously Sep 07 '22 at 23:29
  • you can set timeout to 5 or 10 seconds in page.waitForFunction. If the new height is not greater than previous height this function will through an error. So our waitForFunction() in try block. If you are in catch block that means no more content is loaded and you may have reached to bottom or content is not loaded in whatever timout you set. – Kashif Ghafoor Sep 07 '22 at 23:45
  • you mean something like this. ```try {await page.waitForFunction(`document.querySelector('ytd-app').scrollHeight > ${previousHeight}`, { timeout: 5 } ); } catch (error) { console.log('Done') }``` – seriously Sep 07 '22 at 23:54
  • if so this doesn't work – seriously Sep 08 '22 at 00:02
  • your code is ok. The unit of timeout is in milliseconds. So, for 5 seconds you should input 5000. I have edited the code and it's working ok. – Kashif Ghafoor Sep 08 '22 at 09:32
  • take a look at this https://jsfiddle.net/8k0pgx6j/#&togetherjs=du9IlFnM6K – seriously Sep 08 '22 at 09:34
  • 1
    it seems Alright. – Kashif Ghafoor Sep 08 '22 at 09:44