How do you scrap a web page with infinite scrolling where the response is html/text instead of json.
My first try was using Rule and LinkExtractor which gets me around 80% of the jobs url
class JobsetSpider(CrawlSpider):
name = 'test'
allowed_domains = ['jobs.et']
start_urls = ['https://jobs.et/jobs/']
rules = (
Rule(LinkExtractor(allow='https://jobs.et/job/\d+/'), callback='parse_link'),
Rule(LinkExtractor(), follow=True),
)
def parse_link(self, response):
yield {
'url': response.url
}
My second attempt was to use the example from SCRAPING INFINITE SCROLLING PAGES but the response is in text/html not json.
When "load more" button clicked, i can see from Network on Chrome Developer tool the request url
https://jobs.et/jobs/?searchId=1509738711.5142&action=search&page=2
while the "page" number increase.
My question is
- How do i extract the above url from the response header with scrapy when the "load more" button is clicked
- Is there a better way to approach this problem?