Scraping JSON data from XHR response

Question

I am trying to scrape some information from this page: https://salesforce.wd1.myworkdayjobs.com/en-US/External_Career_Site/job/United-Kingdom---Wales---Remote/Enterprise-Account-Executive-Public-Sector_JR65970

When the page loads and I look at the XHR, the response tab for that URL request delivers the info I'm looking for in JSON format. But, if I try to do json.loads(response.body.decode('utf-8')) on that page, I don't get the data I'm looking for because the page loads with JavaScript. Is it possible to just pull that JSON data from the page somehow? Screen shot of what I'm looking at below.

AaronS · Accepted Answer · 2020-07-05T12:51:23.470

I saw this post on r/scrapy thought I'd answer here.

It's always best to try and replicate the requests when it comes to json data. Json data is called upon on request from the website server, therefore if we make the right HTTP request we can get the response we want.

Using the dev tools under XHR, you can get the referring URL, headers and cookies. See the images below.

Request url: https://i.stack.imgur.com/5BR8z.jpg

Request headers and cookies: https://i.stack.imgur.com/x1ufM.jpg

Within scrapy the request object allows you to specify the URL in this case the request URL seen in the dev tools. But it also allows us to specify the headers and cookies too! Which we can get from the last image.

So something like this would work click here for code. import scrapy

 class TestSpider(scrapy.Spider):
     name = 'test'
     allowed_domains = ['salesforce.wd1.myworkdayjobs.com']
     start_urls = ['https://salesforce.wd1.myworkdayjobs.com/en- 
                   US/External_Career_Site/job/United-Kingdom---Wales--- 
                  Remote/']

     cookies = {
        'PLAY_LANG': 'en-US',
        'PLAY_SESSION': '5ff86346f3ba312f6d57f23974e3cff020b5c33e- 
        salesforce_pSessionId=o3mgtklolr1pdpgmau0tc8nhnv^&instance=
        wd1prvps0003a',
        'wday_vps_cookie': '3425085962.53810.0000',
        'TS014c1515': '01560d0839d62a96c0b
         952e23282e8e8fa0dafd17f75af4622d072734673c
         51d4a1f4d3bc7f43bee3c1746a1f56a728f570e80f37e',
        'timezoneOffset': '-60',
         'cdnDown': '0',
         }

     headers = {
        'Connection': 'keep-alive',
        'Accept': 'application/json,application/xml',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 
         Safari/537.36',
        'X-Workday-Client': '2020.27.015',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Sec-Fetch-Site': 'same-origin',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Dest': 'empty',
        'Referer': 'https://salesforce.wd1.myworkdayjobs.com/en- 
         US/External_Career_Site/job/United-Kingdom---Wales---Remote/
         Enterprise-Account-Executive-Public-Sector_JR65970',
        'Accept-Language': 'en-US,en;q=0.9',
      }
   
    def parse(self, response):
         url = response.url + 'Enterprise-Account-Executive-Public- 
         Sector_JR65970'
         yield scrapy.Request(url=url,headers=self.headers, 
                              cookies=self.cookies, callback=self.start)

    def start(self,response):
         info = response.json()
         print(info)

We specify a dictionary of headers and cookies at the start. We then use the parse function to specify the correct url.

Notice I used response.url which gives us the starting url specified above and I add the last part of the URL on as the correct url in the dev tools. Not particularly necessary but little bit less repeating code.

We then do a scrapy Request with the correct headers and cookies and ask for the response to be called back to another function. Here we deserialise the json response into a python object and print it out.

Note response.json() is a new feature of Scrapy which deserialises json into a python object, see here for details.

A great stackoverflow discussion on replicating AJAX requests in scrapy is found here.

Note if you don't have version v2.2 you wont be able to use the json() method. However alternative if you import json and in the start function created use the json.loads method on the response.text. — AaronS, Jul 05 '20 at 14:59
We discussed this on r/scrapy and were happy with the response, was wondering could you accept this as the official answer (By clicking the tick) ? would help me out! — AaronS, Jul 08 '20 at 07:26

score 0 · Answer 2 · answered Jul 04 '20 at 15:00

0

To read JSON response in scrapy you can use following code:

j_obj = json.loads(response.body_as_unicode())

answered Jul 04 '20 at 15:00

Roman

1,883
2
14
26

Scraping JSON data from XHR response

2 Answers2