I saw this post on r/scrapy thought I'd answer here.
It's always best to try and replicate the requests when it comes to json data. Json data is called upon on request from the website server, therefore if we make the right HTTP request we can get the response we want.
Using the dev tools under XHR, you can get the referring URL, headers and cookies. See the images below.
Request url: https://i.stack.imgur.com/5BR8z.jpg
Request headers and cookies: https://i.stack.imgur.com/x1ufM.jpg
Within scrapy the request object allows you to specify the URL in this case the request URL seen in the dev tools. But it also allows us to specify the headers and cookies too! Which we can get from the last image.
So something like this would work click here for code.
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['salesforce.wd1.myworkdayjobs.com']
start_urls = ['https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---
Remote/']
cookies = {
'PLAY_LANG': 'en-US',
'PLAY_SESSION': '5ff86346f3ba312f6d57f23974e3cff020b5c33e-
salesforce_pSessionId=o3mgtklolr1pdpgmau0tc8nhnv^&instance=
wd1prvps0003a',
'wday_vps_cookie': '3425085962.53810.0000',
'TS014c1515': '01560d0839d62a96c0b
952e23282e8e8fa0dafd17f75af4622d072734673c
51d4a1f4d3bc7f43bee3c1746a1f56a728f570e80f37e',
'timezoneOffset': '-60',
'cdnDown': '0',
}
headers = {
'Connection': 'keep-alive',
'Accept': 'application/json,application/xml',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116
Safari/537.36',
'X-Workday-Client': '2020.27.015',
'Content-Type': 'application/x-www-form-urlencoded',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://salesforce.wd1.myworkdayjobs.com/en-
US/External_Career_Site/job/United-Kingdom---Wales---Remote/
Enterprise-Account-Executive-Public-Sector_JR65970',
'Accept-Language': 'en-US,en;q=0.9',
}
def parse(self, response):
url = response.url + 'Enterprise-Account-Executive-Public-
Sector_JR65970'
yield scrapy.Request(url=url,headers=self.headers,
cookies=self.cookies, callback=self.start)
def start(self,response):
info = response.json()
print(info)
We specify a dictionary of headers and cookies at the start. We then use the parse function to specify the correct url.
Notice I used response.url which gives us the starting url specified above and I add the last part of the URL on as the correct url in the dev tools. Not particularly necessary but little bit less repeating code.
We then do a scrapy Request with the correct headers and cookies and ask for the response to be called back to another function. Here we deserialise the json response into a python object and print it out.
Note response.json() is a new feature of Scrapy which deserialises json into a python object, see here for details.
A great stackoverflow discussion on replicating AJAX requests in scrapy is found here.