0

This script scrolls a page with infinite scrolling and captures all the links.

It moves towards the bottom repeatedly loading new content each time

  1. How can I return the results?
  2. Moreover, how can I return results in chunks, avoiding appending partial results to the same array?

The script:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
                                            headless: false,
                                            userDataDir: "C:\\Users\\johndoe\\AppData\\Local\\Google\\Chrome\\User Data\\Default"
                                        });
  const page = await browser.newPage();
  await page.setViewport({
    width: 1920,
    height: 1080,
    deviceScaleFactor: 1,
  });
  await page.goto('https://www.facebook.com/groups/000000000000/members',{waitUntil: 'networkidle0'});
  page.on('console', msg => console.log('PAGE LOG:', msg.text()));  //sottoscrivo l'evento console e lo recupero nell'evaluate

  let rawMembers = await page.evaluate(() => { 

    const intervall = 3000;
    let stop = false;
    document.addEventListener('keypress', e => stop = true);  //press a key to exit

    let results = [];

    let pageHeigth = 0;
    let timerId = setTimeout(function tick() {

      if ((stop === false) && (document.body.scrollHeight > pageHeigth)){

        pageHeigth = document.body.scrollHeight  //save the current page heigth
        document.scrollingElement.scrollTop = pageHeigth;  //move the scroll to the end of the page (page visible size), this will couse new content to be loaded - virtula scroll)

        console.log('PAGE HEIGTH: ', pageHeigth);

        //do the work (capture what i need, all the links in my case)
        const anchors = Array.from(document.querySelectorAll('a'));
        const serializableLinks = anchors.map(x => x.getAttribute("href"));   //convert to something serializable (string)
        results.concat(serializableLinks);

        timerId = setTimeout(tick, intervall);  //schedule a new timeout to repeat the function
      } 
      else
      {
        clearTimeout(timerId)
        console.log('Exit');
        return results;
      }

    }, intervall);
  });

  //await browser.close();
})();
ggorlen
  • 44,755
  • 7
  • 76
  • 106
pinale
  • 2,060
  • 6
  • 38
  • 72
  • Does this answer your question? [Puppeteer - scroll down until you can't anymore](https://stackoverflow.com/questions/51529332/puppeteer-scroll-down-until-you-cant-anymore) – ggorlen Feb 03 '21 at 23:57

2 Answers2

1

You can use

await page.evaluate(() => {
  document.scrollingElement.scrollTop = document.body.scrollHeight
})

This will scroll to the bottom of the page. If you want to scroll in a DOM element you can simply

await page.evaluate(() => {
  let domElement = document.querySelector(YOUR DOM ELEMENT)
  domElement.scrollTop = domElement.scrollHeight   
})
Sezerc
  • 169
  • 9
  • this scrolls to the end correctly, but once you get to the bottom of the page, new bottom space becomes available again and again (infinite scroll) , how to manage this use case? – pinale Jan 14 '21 at 08:56
  • There are many ways you can achieve this. You can do that with recursive functions. You need to scroll to the end of the page, get all the data you want and call the function again. You are just asking me how to program. Its up to you. – Sezerc Jan 14 '21 at 09:32
0

If you need to make it more human intuitive you can try await page.keyboard.press("PageDown");

Mark O
  • 927
  • 9
  • 13