Scraping Infinite Scrolling Pages with "load more" button using Scrapy

Question

How do you scrap a web page with infinite scrolling where the response is html/text instead of json.

My first try was using Rule and LinkExtractor which gets me around 80% of the jobs url

class JobsetSpider(CrawlSpider):
    name = 'test'
    allowed_domains = ['jobs.et']
    start_urls = ['https://jobs.et/jobs/']

    rules = (
        Rule(LinkExtractor(allow='https://jobs.et/job/\d+/'), callback='parse_link'),
        Rule(LinkExtractor(), follow=True),
    )

    def parse_link(self, response):
        yield {
            'url': response.url
        }

My second attempt was to use the example from SCRAPING INFINITE SCROLLING PAGES but the response is in text/html not json.

When "load more" button clicked, i can see from Network on Chrome Developer tool the request url

https://jobs.et/jobs/?searchId=1509738711.5142&action=search&page=2

while the "page" number increase.

My question is

How do i extract the above url from the response header with scrapy when the "load more" button is clicked
Is there a better way to approach this problem?

Tony · Accepted Answer · 2017-11-03T21:39:42.283

Ignore the "Load More" button.

You can access all the pages of jobs using URLs, as you mention. When you parse the first page of results find the total number of jobs from the header element

<h1 class="search-results__title ">
268 jobs found
</h1>

The site displays 20 jobs per page, so you need to scrape 268/20 = 13.4 (rounded up to 14) pages.

When you finish parsing the first page create a generator to yield URLS for the subsequent pages (in a loop up to 14) and parse the result with another function. You will need the searchId which you can't get from the URL but it's in a hidden field on the page.

<input type="hidden" name="searchId" value="1509738711.5142">

Using that and the page number you can build your URLs

https://jobs.et/jobs/?searchId=<id>&action=search&page=<page>

Yes, the parse function will be doing exactly the same as your first page parser, but while you get it all working it's good to live with the code duplication to keep things straight in your head.

The code for this could be something like

class JobsetSpider(CrawlSpider):
    ...
    start_urls = ['https://jobs.et/jobs/']
    ...

    def parse(self, response):
        # parse the page of jobs
        ...
        job_count = xpath(...)
        search_id = xpath(...)
        pages =  math.ceil(job_count / 20.0)
        for page in range(2, pages):
            url = 'https://jobs.et/jobs/?searchId={}&action=search&page={}'.format(search_id, page)
            yield Request(url, callback = self.parseNextPage)

    def parseNextPage(self, response):
        # parse the next and subsequent pages of jobs
        ...

I see thanks, I would only edit the pages to `pages = math.ceil(job_count / 20.0) + 1`, since range dose not include the end. — sgetachew, Nov 04 '17 at 14:16

score 1 · Answer 2 · answered Nov 04 '17 at 00:49

1

You could add something like:

has_next = response.css('.load-more').extract()
if has_next:
    next_page = response.meta.get('next_page', 1) + 1
    url = response.urljoin(response.css('script').re_first("'(\?searchId.*page=)'") + str(next_page))
    yield Request(url , meta={'next_page': next_page})

answered Nov 04 '17 at 00:49

Wilfredo

1,548
1
9
9

it returns the correct url format but the number of urls returned is infinite. – sgetachew Nov 04 '17 at 14:29
hmm, you could check whether the page actually brings "new" data or whenever a date limit was reached or send in the meta the current number of elements seen and stop whenever you've seen them all (for example assuming `268` like Tony showed and pass to the meta `268-20` and so on (i.e.: continue to make requests until you reach 0). – Wilfredo Nov 04 '17 at 17:43

Scraping Infinite Scrolling Pages with "load more" button using Scrapy

2 Answers2

Linked