1

I am trying to get out product data from a website that loads the product list as the user scrolls down. I am using Apify for this. My first thought was to see if somebody had already solved this and I found 2 useful links: How to make the Apify Crawler to scroll full page when web page have infinite scrolling? and How to scrape dynamic-loading listing and individual pages using Apify?. However, when I tried to apply the functions they mention, my Apify crawler failed to load the content.

I am using a web-scraper based on the code in the basic web-scraper repository.

The website I am trying to get data out of is in this link. For the moment I am just learning so I just want to be able to get the data out of this one page, I do not need to navigate to other pages.

The PageFunction I am using is the following:

async function pageFunction(context) {
    // Establishing uility constants to use throughout the code
    const { request, log, skipLinks } = context;
    const $ = context.jQuery;
    const pageTitle = $('title').first().text();
    context.log.info('Wait for website to render')
    await context.waitFor(2000)

    //Creating function to scroll the page til the bottom
    const infiniteScroll = async (maxTime) => {
        const startedAt = Date.now();
        let itemCount = $('.upcName').length;
        
        for (;;) {
            log.info(`INFINITE SCROLL --- ${itemCount} initial items loaded ---`);
            // timeout to prevent infinite loop
            if (Date.now() - startedAt > maxTime) {
                return;
            }
            
            scrollBy(0, 99999);
            await context.waitFor(1000); 
            
            const currentItemCount = $('.upcName').length;
            log.info(`INFINITE SCROLL --- ${currentItemCount} items loaded after scroll ---`);

            if (itemCount === currentItemCount) {
                return;
            }
            itemCount = currentItemCount;

        }

    };

    context.log.info('Initiating scrolling function');
    await infiniteScroll(60000);
    context.log.info(`Scraping URL: ${context.request.url}`);

    var results = []
    $(".itemGrid").each(function() {
        results.push({
            name: $(this).find('.upcName').text(),
            product_url: $(this).find('.nombreProductoDisplay').attr('href'),
            image_url: $(this).find('.lazyload').attr('data-original'),
            description: $(this).find('.block-with-text').text(),
            price: $(this).find('.upcPrice').text()
        });

    });

    return results
}

I replaced the while(true){...} loop for a for(;;){...} because I was getting a Unexpected constant condition. (no-constant-condition)ESLint error.

Also, I have tried varying the magnitude of the scroll and the await periods.

In spite of all this, I cannot seem to get the crawler to get me more than 32 results.

Could someone please explain to me what am i doing wrong?

################ UPDATE ################## I continued to work on this and could not make it work from the Apify platform so my original question still stands. However, I did manage to make the scroll function work by running the script from my pc.

1 Answers1

0

in this particular case, you can check for the loading spinner visibility after scrolling, instead of trying to count the number of items.

by changing your code a bit, you can make it like this:

async function pageFunction(context) {
    // Establishing uility constants to use throughout the code
    const { request, log, skipLinks } = context;
    const $ = context.jQuery;
    const pageTitle = $('title').first().text();
    context.log.info('Wait for website to render')
    // wait for initial listing
    await context.waitFor('.itemGrid'); 

    context.log.info(`Scraping URL: ${context.request.url}`);

    let tries = 5; // keep track of the load spinner being invisible on the page
    const results = new Map(); // this ensures you only get unique items
   
    while (true) { // eslint-disable-line
        log.info(`INFINITE SCROLL --- ${results.size} initial items loaded ---`);
        // when the style is set to "display: none", it's hidden aka not loading any new items
        const hasLoadingSpinner = $('.itemLoader[style*="none"]').length === 0; 

        if (!hasLoadingSpinner && tries-- < 0) {
            break;
        }
        
        // scroll to page end, you can adjust the offset if it's not triggering the infinite scroll mechanism, like `document.body.scrollHeight * 0.8`
        scrollTo({ top: document.body.scrollHeight });

        $(".itemGrid").each(function() {
            const $this = $(this);

            results.set($this.find('#upcProducto').attr('value'), {
                name: $this.find('.upcName').text(),
                product_url: $this.find('.nombreProductoDisplay').attr('href'),
                image_url: $this.find('.lazyload').data('original'),
                description: $this.find('.block-with-text').text(),
                price: $this.find('.upcPrice').text()
            });
        });
      
        // because of the `tries` variable, this will effectively wait at least 5 seconds to consider it not loading anymore
        await context.waitFor(1000);       
        // scroll to top, sometimes scrolling past the end of the page does not trigger the "load more" mechanism of the page
        scrollTo({ top: 0 }); 
    }

    return [...results.values()]
}

this method also works for virtual pagination, like React Virtual or Twitter results that remove DOM nodes when they are not in the viewport.

using timeouts is very brittle and depending on how fast/slow your scraper is working, your results will vary. so you need a clear indication that the page is not delivering new items.

you can also keep track of the document.body.scrollHeight, as it will change when there are new items.

pocesar
  • 6,860
  • 6
  • 56
  • 88
  • Hi, thank you very much for the answer. I am still having trouble with the scroller. Although the code works and gets the first 20 results of the webpage, it repeats the loop several times without actually loading more results (93 should appear). It seems like either the scroll is not working, or the results are not loading correctly. Is there a way of debugging this? – Manuel Jiménez Aug 10 '21 at 19:26