9

I am trying to figure out what I'm doing wrong here, but I keep getting lost...

In python 2.7, I'm running following code:

>>> import requests
>>> req = requests.request('GET', 'https://www.zomato.com/praha/caf%C3%A9-a-restaurant-z%C3%A1ti%C5%A1%C3%AD-kunratice-praha-4/daily-menu')
>>> req.content
'<html><body><h1>500 Server Error</h1>\nAn internal server error occured.\n</body></html>\n'

If I open this one in browser, it responds properly. I was digging around and found similar one with urllib library (500 error with urllib.request.urlopen), however I am not able to adapt it, even more I would like to use requests here.

I might be hitting here some missing proxy setting, as suggested for example here (Perl File::Fetch Failed HTTP response: 500 Internal Server Error), but can someone explain me, what is the proper workaround with this one?

Community
  • 1
  • 1
Kube Kubow
  • 398
  • 1
  • 6
  • 18
  • Have you tried requesting any other page? Maybe you need to add a User-Agent header from Firefox or something like that, because the page doesn't respond to queries by the python request library. – Maurice Nov 05 '16 at 20:07
  • From looking at what happens in the network log when you load this page in a browser, it's at least in part a React app that dynamically renders its content in the browser. You are not likely to have much luck scraping it directly with `requests`. – Bill Gribble Nov 05 '16 at 20:38
  • @Maurice: yes, I had. I have problems just wish some of them, the rest is working... – Kube Kubow Nov 05 '16 at 20:50
  • @BillGribble : What would you please recommend as an universal approach to scraping web pages as general? – Kube Kubow Nov 05 '16 at 21:07
  • 1
    There's no universal answer. `requests` is great for fetching stuff with HTTP(S) but if you need to see what's in a browser you need a browser. I have had good luck with [Selenium](http://docs.seleniumhq.org/) when you simply have to scrape a page that's mostly rendered by Javascript. It gives you an API that lets you drive and query a running browser. If you can find the underlying API endpoints that the page is pulling its data from you can use `requests` and you'll be better off. – Bill Gribble Nov 05 '16 at 21:34

3 Answers3

11

One thing that is different with the browser request is the User-Agent; however you can alter it using requests like this:

url = 'https://www.zomato.com/praha/caf%C3%A9-a-restaurant-z%C3%A1ti%C5%A1%C3%AD-kunratice-praha-4/daily-menu'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.90 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.status_code) #should be 200

Edit

Some web applications will also check the Origin and/or the Referer headers (for example for AJAX requests); you can set these in a similar fashion to User-Agent.

headers = {
    'Origin': 'http://example.com',
    'Referer': 'http://example.com/some_page'
}

Remember, you are setting these headers to basically bypass checks so please be a good netizen and don't abuse people's resources.

Ionut Ticus
  • 2,683
  • 2
  • 17
  • 25
2

The User-Agent, and also other header elements, could be causing your problem.

When I came accross this error I watched a regular request made by a browser using Wireshark, and it turned out there were things other than just the User-Agent in the header which the server expected to be there.

After emulating the header sent by the browser in python requests, the server stopped throwing errors.

2

But Wait! There's More!

The above answers did help me on the path to resolution, but I had to find still more things to add to my headers so that certain sites would let me in using python requests. Learning how to use Wireshark (suggested above) was a good new skill for me, but I found an easier way.

If you go to your developer view (right-click then click Inspect in Chrome), then go to the Network tab, and then select one of the Names at left and then look under Headers for Requests Headers and expand, you'll get a complete list of what your system is sending to the server. I started adding elements that I thought were most likely needed one at a time and testing until my errors went away. Then I reduced that set to the smallest possible set that worked. In my case, with my headers having only User-Agent to deal with other code issues, I only needed to add the Accept-Language key to deal with a few other sites. See picture below as a guide to the text above.

I hope this process helps others to find ways to eliminate undesirable python requests return codes where possible.

Screen Shot of my Developer/Inspect Window in Chrome

Thom Ives
  • 3,642
  • 3
  • 30
  • 29