0

I am trying to scrape data from a website and was in the process of using the Scrapy framework to build a spider to iterate over various pages and collect the data from relevant Xpaths but realised that all the data I want is actually being delivered in a response in the form of a JSON (see image below for JSONs) and these are being used to render the page.My understanding is that the JSON is part of an XHR request. I am wondering if there is a way to intercept or copy these JSONs/XHR requests rather than build a spider that has to navigate the fully assembled page?

I am not expecting a full solution to be posted but would be quite content with a pointer to the correct framework or other relevant learning resource so I can study further. I have only been programming 4 months so far.

Google Dev tools

Avlento
  • 1
  • 2
  • Those xhr are made by javascript on the page so you have to either know the xhr urls in advance and scrape that or use a web browser, load the page and then scrape the content. For the latter, you could use something like Selenium https://selenium-python.readthedocs.io/getting-started.html – luis.parravicini Nov 14 '19 at 11:41
  • Thanks Luis. I am aware of selenium but have started down the path of using scrapy for the web scraping now. I was just trying to understand what those XHR/JSONs were and if there is an easier way to do it. I gathered there must be Java script on the page performing an AJAX request on a timer as new JSONs keep being delivered. I was just wondering if the XHR could be intercepted rather than having to scrape the Xpaths. You have answered that question though so thank you. Do you know of any good resources to study XHRs in depth? – Avlento Nov 14 '19 at 13:54
  • Sometimes it's a good idea to check which requests were sent to the server. Usually you will find the pattern quite fast. – Thomas Strub Nov 14 '19 at 15:55
  • I have been solid at it today trying to get scrapy to search for Xpaths by class but scrapy can't identify some of the classes that are essential to me and returns an empty array. I suspect its to do with the content being delivered via AJAX. I finally came across this post that seems to be related. https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax . I will update this thread on my progress. – Avlento Nov 14 '19 at 16:42
  • https://docs.scrapy.org/en/latest/topics/dynamic-content.html – Gallaecio Nov 18 '19 at 11:01

0 Answers0