0

I'm trying to request a URL with data encoded in base64 on it, like so:

http://www.somepage.com/es_e/bla_bla#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ==

What I do, is build a JSON object, encode it into base64, and append it to a url like this:

new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1}, "config": {"page": 2}}
json_data = json.dumps(new_data)
new_url = "http://www.somepage.com/es_es/bla_bla#" + base64.b64encode(json_data)
yield scrapy.Request(url=new_url, callback=self.parse)

The problem is that Scrapy crawls only this part of the URL http://www.somepage.com/es_es/bla_bla without the data encoded and appended to it...however, if I paste the new_url into the browser, it shows me the result I want with the data encoded!

Don't know what's happening...Can anyone give me a hand?

wj127
  • 118
  • 1
  • 12
  • 2
    The query fragment (part of the URL after `#`), only applies to browsers. Servers ignore that part of a URL. – Martijn Pieters Sep 03 '17 at 15:44
  • Javascript code loaded by the browser is free to use the query fragment to alter behaviour. E.g. a script may use that part to load AJAX data or alter the page in some way. That's all client-side, and has nothing to do with what the server sends the browser. – Martijn Pieters Sep 03 '17 at 15:45
  • so, there's no way to achieve the same on the server side?! @MartijnPieters – wj127 Sep 03 '17 at 15:52
  • You'd have to simulate a full browser, if there is JS code to be executed. Selenium is the go-to option for that. – Martijn Pieters Sep 03 '17 at 15:53
  • I've been reading a while about Selenium, but I'm a bit confused about how to use it to solve my problem...do you know how to do it?! @MartijnPieters – wj127 Sep 03 '17 at 17:02
  • Sorry, I've not used Selenium in years now. – Martijn Pieters Sep 03 '17 at 17:48
  • 1
    ok, no problem, thank you so much anyway for the info you gave me. At least I know a bit more about my problem! @MartijnPieters – wj127 Sep 03 '17 at 21:11

1 Answers1

0

After been searching a lot, I read that this kind of URLs, the one with a # at the end (i.e. my URL http://www.somepage.com/es_e/bla_bla#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ==) are called Fragment URLs and basically they indicate a location within a resource, like an anchor (you can read it here).

And then, thanks to this post I learned that those contents need to be loaded by the page, so the website itself makes requests to get that data (Outgoing Requests), so what I did was to search for those Outgoing Requests using Firefox Developer Edition (you can use any other system that shows you these requests, like Tamper Data), and build the URL that gives me the HTML content I was looking for.

# The base64 data encoded as a JSON is appended after the 'searchRequest=' instead of using the '#' element, and voilà!
"http://www.somewebsite.es/?controller=ajaxresults&action=getresults&searchRequest=eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6N30sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="

I could also achieve this by using the Selium library, as you can see in this other post, but isn't the best practice...

wj127
  • 118
  • 1
  • 12