0

I'm scraping a website which loads product data from individual JSON files. I found the URLs to the JSONs by inspecting the network traffic.

The problem is this: when I follow the JSON URLs, most of the links will provide a JSON result. But the JSON URLs of products that have special characters in them, eg é, return a null response. Of course the data is shown on the browser but I can't seem to get the JSON response directly.

Any tips?

(I'm trying to find a similar website that acts in the same way so I can post it here for example)

EDIT:

Here is an example

Product A url: https://www.boozebud.com/p/hopnationbrewingco/thedamned

WORKS: A's JSON url: https://www.boozebud.com/a/producturl/p/hopnationbrewingco/thedamned

Product B url: https://www.boozebud.com/p/àbloc/superprestigenaturalblondebeer

RETURNS NULL: B's JSON url: https://www.boozebud.com/a/producturl/p/àbloc/superprestigenaturalblondebeer

(Related to my previous unanswered question: scrapy: dealing with special characters in url which might need to be revised in light of this question)

happyspace
  • 113
  • 1
  • 2
  • 12
  • A [mcve] is really needed to be able to address this. – Charles Duffy Nov 29 '17 at 22:53
  • Is the "a" with the accent over it supposed to be in the url? – drsnark Nov 29 '17 at 23:54
  • Yes. The original url comes from searching for the product from the home page and the JSON url has been picked up from the network traffic. – happyspace Nov 30 '17 at 00:02
  • only idea to use local proxy server to catch real request sends from browser to server. Maybe it uses different char or the same char but in different encoding. Or maybe server checks some headers like `XHR` or `REFERER` – furas Nov 30 '17 at 01:04

1 Answers1

2

It seems to me that the problem is the headers, it seems to be very sensitive to at least the Content-Type header, it seems it's used internally on the server to decode the incoming URL or something like that. Try downloading the request like this (this is what the internal js is doing)

yield Request('https://www.boozebud.com/a/producturl/p/%C3%A0bloc/superprestigenaturalblondebeer', 
              headers={"Content-Type": "application/json; charset=UTF-8"})
Wilfredo
  • 1,548
  • 1
  • 9
  • 9