0

I am trying to scrape this webpage (for educational purposes).

When I extract the xpath, and try it in element inspector in browser, it works. For example to get the address, I use the xpath below:

//div[@class="address-coords"]/div[@class="address"]/p/span[@itemprop="address"]

Meanwhile, in scrapy shell, it does not work:

$ scrapy shell 'https://cloud.baladovore.com/map/sNRgAcGKiY' -s U
SER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, l
ike Gecko) Chrome/46.0.2490.80 Safari/537.36'

In [5]: response.xpath('//div[@class="address-coords"]/div[@class="address"]/p/span[@it
   ...: emprop="address"]').getall()

Out[5]: []

I get an empty list, although the responses is 200:

In [6]: response
Out[6]: <200 https://cloud.baladovore.com/map/008jPJuORI>

I already tried all suggestions I found in Internet. Like changing the user agent, setting ROBOTSTXT_OBEY to False, and increasing the delay. I would really appreciate it if someone helped me solve this problem, since I was working on it for days.

  • Possible duplicate of [Can scrapy be used to scrape dynamic content from websites that are using AJAX?](https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax) – Chillie Jul 12 '19 at 09:55

1 Answers1

0

If you use the scrapy shell to look at the response's content (with response.body) you'll see that the server responds with a small page full of scripts that are then executed.

So you either need a way to run Javascript with Scrapy or to directry query the server to get the results. Using the browser's Dev tools (Network) is a common way to inspect those queries (as described in the linked answer ).

Another solution is to use Selenium to simulate a full browser.

Edit 1: You need to go further than just https://cloud.baladovore.com/parse/classes/Address.

If you inspect the request, you'll see that it not only requests that page, but also supplies additional infomation:

Request URL: https://cloud.baladovore.com/parse/classes/Address

Request Method: POST

Request Payload: {"where":{"objectId":"sNRgAcGKiY"},"limit":1,"_method":"GET","_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX","_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u","_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"}

Let's try simulating that with requests:

import requests

access_data = {"where":{"objectId":"sNRgAcGKiY"},
"limit":1,
"_method":"GET",
"_ApplicationId":"cB4rsS2KbFIG5IQyjJv0XaDC8M28e0YDu58SaolX",
"_JavaScriptKey":"eDoqTmoIS6Ofpf0OAgNdYKGm9TBs2fVv9MR8lS5u",
"_ClientVersion":"js1.6.14","_InstallationId":"02f7b7dd-31c7-b235-df1d-93c323dbcd60"
}
url = 'https://cloud.baladovore.com/parse/classes/Address'
test_req = requests.post(url, json=access_data)
test_req.status_code
test_req.json()

This outputs the decoded json response that you can work with.

I do not know _JavaScriptKey's properties. You will need to look into that.

If you insist on using Srapy you will need to read the documentation on how to set request bodies.

Chillie
  • 1,356
  • 13
  • 16
  • I have already tried that way, but it didn't work. Here is the link I got: https://cloud.baladovore.com/parse/classes/Address Yet, unfortunately, it gives error403 in scrapy shell, and when I try it in browser, it gives: {"error":"unauthorized"} – Th3FreeSpirit Jul 12 '19 at 10:15