1

I'm trying to scrape data that is loaded after an ajax request.

For example, the first 30 videos of this youtube page are seen in html and then the user must click a "load more" button which triggers ajax and gets more results. https://www.youtube.com/user/testedcom/videos

I can get the ajax link, but what is the best way to pull remaining data / 'paginate' with Scrapy features?

start shell:

scrapy shell https://www.youtube.com/user/testedcom/videos

get url for ajax continuation:

continuation_url = response.xpath('//*[@class="yt-uix-button yt-uix-button-size-default yt-uix-button-default load-more-button yt-uix-load-more browse-items-load-more-button"]/@data-uix-load-more-href').extract()[0]
url = "https://www.youtube.com/user/testedcom/videos" + continuation_url

get new data from ajax call:

fetch(url)

...but from here I'm not sure what to do with the data. It's not in the same format as the original response from running scrapy shell. It doesn't quite seem to load as JSON. I assume scrapy has something specifically for this but can't find it in the docs.

edit I can get the html content by doing:

import json
response_json = json.loads(response.body_as_unicode())
html = response_json['content_html']

but then I would have to use regular expressions to pull out the desired data out of this unicode instead of built in xpath selectors which are much more convenient.

Would prefer to not use Selenium or another add-on like in this solution. Speed and simplicity is a priority.

Community
  • 1
  • 1
ProGirlXOXO
  • 2,170
  • 6
  • 25
  • 47

2 Answers2

2

Here is the documentation of Scrapy Selector: http://doc.scrapy.org/en/1.1/topics/selectors.html

I have met the same question. And I deal with it by Selector. You can construct a Selector by response or string, then 'xpath' could be used.

Also, you can use try...except... to identify the type of response(html or json)

def parse(self, response):
    try:
        jsonresponse = json.loads(response.body_as_unicode())
        html = jsonresponse['content_html'].strip()
        sel = Selector(text=html)
    except:
        sel = Selector(response=response)

    entries = sel.xpath(
        '//li[contains(@class,"feed-item-container")]')
    for entry in entries:
        try:
            title = entry.xpath('.//h3/a/text()').extract()[0]
            item = YoutubeItem()
            item['title'] = title
            yield item
        except Exception as err:
            continue

    try:
        jsonresponse = json.loads(response.body_as_unicode())
        sel = Selector(text=jsonresponse['load_more_widget_html'])
    except:
        sel = Selector(response=response)
    try:
        url = "https://www.youtube.com" + \
            sel.xpath(
                '//button[contains(@class,"load-more-button")]/@data-uix-load-more-href').extract()[0]
        req = scrapy.Request(url, callback=self.parse)
        yield req
    except:
        self.log('Scawl completed.')
Irmo
  • 131
  • 1
  • 11
0

After obtaining the html content, you can initialize a Selector object in order to use the xpath selectors :

from scrapy.selector import Selector
import json

response_json = json.loads(response.body_as_unicode())
html = response_json['content_html']
sel = Selector(text=html)
for url in sel.xpath('//@href').extract():
    yield Request(url, callback=self.somecallbackfunction)
Rejected
  • 4,445
  • 2
  • 25
  • 42
Arijit C
  • 96
  • 4