I'm trying to scrape data that is loaded after an ajax request.
For example, the first 30 videos of this youtube page are seen in html and then the user must click a "load more" button which triggers ajax and gets more results. https://www.youtube.com/user/testedcom/videos
I can get the ajax link, but what is the best way to pull remaining data / 'paginate' with Scrapy features?
start shell:
scrapy shell https://www.youtube.com/user/testedcom/videos
get url for ajax continuation:
continuation_url = response.xpath('//*[@class="yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button"]/@data-uix-load-more-href').extract()[0]
url = "https://www.youtube.com/user/testedcom/videos" + continuation_url
get new data from ajax call:
fetch(url)
...but from here I'm not sure what to do with the data. It's not in the same format as the original response from running scrapy shell. It doesn't quite seem to load as JSON. I assume scrapy has something specifically for this but can't find it in the docs.
edit I can get the html content by doing:
import json
response_json = json.loads(response.body_as_unicode())
html = response_json['content_html']
but then I would have to use regular expressions to pull out the desired data out of this unicode instead of built in xpath selectors which are much more convenient.