How do I use the features of Apify to generate a full list of URLs for scraping from an index page in which items are added in sequential batches when the user scrolls toward the bottom? In other words, it's dynamic loading/infinite scroll, not operating on a button click.
Specifically, this page - https://www.provokemedia.com/agency-playbook I cannot make it identify any other than the initially-displayed 13 entries.
These elements appear to be at the bottom of each segment, with display: none
changing to display: block
at every segment addition. No "style
" tag here is visible in raw source, only via DevTools Inspector.
<div class="text-center" id="loader" style="display: none;">
<h5>Loading more ...</h5>
</div>
Here is my basic setup for web-scraper...
Start URLs:
https://www.provokemedia.com/agency-playbook
{
"label": "START"
}
Link selector:
div.agencies div.column a
Pseudo URLs:
https://www.provokemedia.com/agency-playbook/agency-profile/[.*]
{
"label": "DETAIL"
}
Page function:
async function pageFunction(context) {
const { request, log, skipLinks } = context;
// request: holds info about current page
// log: logs messages to console
// skipLinks: don't enqueue matching Pseudo Links on current page
// >> cf. https://docs.apify.com/tutorials/apify-scrapers/getting-started#new-page-function-boilerplate
// *********************************************************** //
// START page //
// *********************************************************** //
if (request.userData.label === 'START') {
log.info('Store opened!');
// Do some stuff later.
}
// *********************************************************** //
// DETAIL page //
// *********************************************************** //
if (request.userData.label === 'DETAIL') {
log.info(`Scraping ${request.url}`);
await skipLinks();
// Do some scraping.
return {
// Scraped data.
}
}
}
Presumably, inside the START stuff, I need to ensure to reveal the whole list to queue up, more than just the 13.
I have read through Apify's docs, including on "Waiting for dynamic content". await waitFor('#loader');
seemed like a good bet.
I added the following to the START portion...
let timeoutMillis; // undefined
const loadingThing = '#loader';
while (true) {
log.info('Waiting for the "Loading more" thing.');
try {
// Default timeout first time.
await waitFor(loadingThing, { timeoutMillis });
// 2 sec timeout after the first.
timeoutMillis = 2000;
} catch (err) {
// Ignore the timeout error.
log.info('Could not find the "Loading more thing", '
+ 'we\'ve reached the end.');
break;
}
log.info('Going to load more.');
// Scroll to bottom, to expose more
// $(loadingThing).click();
window.scrollTo(0, document.body.scrollHeight);
}
But it didn't work...
2021-01-08T23:24:11.186Z INFO Store opened!
2021-01-08T23:24:11.189Z INFO Waiting for the "Loading more" thing.
2021-01-08T23:24:11.190Z INFO Could not find the "Loading more thing", we've reached the end.
2021-01-08T23:24:13.393Z INFO Scraping https://www.provokemedia.com/agency-playbook/agency-profile/gci-health
Unlike other web pages, this page does not scroll to bottom when I manually enter window.scrollTo(0, document.body.scrollHeight);
into the DevTools Console.
However, when manually executed in Console, this code to add a small delay - setTimeout(function(){window.scrollBy(0,document.body.scrollHeight)}, 1);
- as found in this question - does jump to the bottom each time...
If I add that line to replace the last line of the while loop above, however, the loop still logs that it could not find the element.
Am I using the methods? Not sure which way to turn.