11

I'm learning how to use python requests (Python 3) and I am trying to make a simple requests.get to get the HTML code from several websites. Although it works for most of them, there is one I am having trouble with.

When I call : http://es.rs-online.com/ everything works fine:

In [1]: import requests
   ...:html = requests.get("http://es.rs-online.com/")
In [2]:html
Out[2]: <Response [200]>

However, when I try it with http://es.farnell.com/, python is unable to solve the address and keeps working on it forever. If I set a timeout, no matter how long, the requests.get() will always be interrupted by the timeout and by nothing else. I have also tried adding headers but it didn't solve the issue. Also, I don't think the error has anything to do with the proxy that I'm using, as I am able to open this website in my browser. Currently, my code looks like this:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
html = requests.get("http://es.farnell.com/",headers=headers, timeout=5, allow_redirects = True )

After 5 secs, I get the expected timeout notification.

ReadTimeout: HTTPConnectionPool(host='es.farnell.com', port=80): Read timed out. (read timeout=5)

Does anyone know what could be the issue?

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
ASj
  • 113
  • 1
  • 1
  • 6

1 Answers1

29

The problem is in your header. Do remember that some site are more lenient than others when it comes to the content of the header you are sending. In order to fix the issue, you should replace your current header with:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}

I would also recommend you to send the get request to https://es.farnell.com/ rather than http://es.farnell.com/, remove the timeout = 5 and remove allow_redirects = True (as it is True by default).


All in all your code should look like this:

import requests


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
html = requests.get("https://es.farnell.com",headers=headers)

hope this helps.

Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
  • I am facing a similar issue with website: "https://www.hamburgsud-line.com/liner/en/liner_services/index.html". I need to get cookie data so that I can use it while making an API call to get tracking data. Any help will be much appreciated. TIA! – Juhi Sharma Jan 06 '21 at 12:23
  • @JuhiSharma Have you tried using [request.session()](https://requests.readthedocs.io/en/master/user/advanced/#session-objects) ? – Nazim Kerimbekov Jan 07 '21 at 12:09