404 status code while making HTTP request via Python's "requests" library. However page is loading fine in browser

Question

I am trying to web scrape the content of few of the websites. But I noticed that for some of the websites I am getting the response with status code as 200. However, for some other of them I am getting 404 status code with the response. But when I am opening these websites (returning 404) in the browser, it is loading fine for me. What am I missing here?

For example:

import requests

url_1 = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
url_2 = "https://stackoverflow.com/questions/36516183/what-should-i-use-instead-of-urlopen-in-urllib3"

page_t = requests.get(url_2)
print(page_t.status_code)      #Getting a Not Found page and  404 status

page = requests.get(url_1)
print(page.status_code)       #Getting a Valid HTML page and 200 status

Maybe the site checks if it's a real request and sends 404 to web scrapers! — csabinho, Jan 06 '18 at 06:52

Moinuddin Quadri · Accepted Answer · 2018-01-06T07:03:13.343

The website you mentioned is checking for "User-Agent" in the request's header. You can fake the "User-Agent" in your request by passing the dict object with Custom Headers in your requests.get(..) call. It'll make it look like it is coming from the actual browser and you'll receive the response.

For example:

>>> import requests
>>> url = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

# Make request with "User-Agent" Header
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200   # success response

>>> response.text  # will return the website content

score 5 · Answer 2 · edited Jan 06 '18 at 06:58

5

Some websites do not allow scraping. So you need to provide a header with user-agent specifying type of browser and the system which says it is a browser request and not some code trying to scrape

use this in your code

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

response = requests.get(url, headers=headers)`

See if this helps

edited Jan 06 '18 at 06:58

eyllanesc

235,170
19
170
241

answered Jan 06 '18 at 06:57

Nishant Nischal Chintalapati

51
2

You Answer worked! Is it Illegal to scrap if the Sites are trying to block the request? – Paul Vannan Jan 06 '18 at 07:12
No, of course it's not. They're just trying to lower the server load and maybe prevent people from stealing data that can be used commercially. As long as you're not stressing the server and don't use the data commercially, they won't even realize! – csabinho Jan 06 '18 at 07:22
Simple reason. You wouldn't let huge sets of data to be given out just by running a script where you spent lot of time on to accumulate with no customer retention or benefit. @PaulVannan – Nishant Nischal Chintalapati Jan 08 '18 at 15:38

score 1 · Answer 3 · answered Jan 06 '18 at 06:58

1

As @csabinho said the site may be checking if it's a real (human) request. So you need to add headers to show the website that it's not a python script.

hdr = {'User-Agent': 'Mozilla/5.0'}
page_t = requests.get(url_t, headers=hdr)
print(page_t.status_code)
# got 200 code for this

answered Jan 06 '18 at 06:58

Keyur Potdar

7,158
6
25
40

Your User-Agent might be too short and simple! – csabinho Jan 06 '18 at 07:05

404 status code while making HTTP request via Python's "requests" library. However page is loading fine in browser

3 Answers3