7

I am trying to web scrape the content of few of the websites. But I noticed that for some of the websites I am getting the response with status code as 200. However, for some other of them I am getting 404 status code with the response. But when I am opening these websites (returning 404) in the browser, it is loading fine for me. What am I missing here?

For example:

import requests

url_1 = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
url_2 = "https://stackoverflow.com/questions/36516183/what-should-i-use-instead-of-urlopen-in-urllib3"

page_t = requests.get(url_2)
print(page_t.status_code)      #Getting a Not Found page and  404 status

page = requests.get(url_1)
print(page.status_code)       #Getting a Valid HTML page and 200 status
Paul Vannan
  • 81
  • 1
  • 1
  • 4

3 Answers3

9

The website you mentioned is checking for "User-Agent" in the request's header. You can fake the "User-Agent" in your request by passing the dict object with Custom Headers in your requests.get(..) call. It'll make it look like it is coming from the actual browser and you'll receive the response.

For example:

>>> import requests
>>> url = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

# Make request with "User-Agent" Header
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200   # success response

>>> response.text  # will return the website content
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
5

Some websites do not allow scraping. So you need to provide a header with user-agent specifying type of browser and the system which says it is a browser request and not some code trying to scrape

use this in your code

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

response = requests.get(url, headers=headers)`

See if this helps

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • You Answer worked! Is it Illegal to scrap if the Sites are trying to block the request? – Paul Vannan Jan 06 '18 at 07:12
  • No, of course it's not. They're just trying to lower the server load and maybe prevent people from stealing data that can be used commercially. As long as you're not stressing the server and don't use the data commercially, they won't even realize! – csabinho Jan 06 '18 at 07:22
  • Simple reason. You wouldn't let huge sets of data to be given out just by running a script where you spent lot of time on to accumulate with no customer retention or benefit. @PaulVannan – Nishant Nischal Chintalapati Jan 08 '18 at 15:38
1

As @csabinho said the site may be checking if it's a real (human) request. So you need to add headers to show the website that it's not a python script.

hdr = {'User-Agent': 'Mozilla/5.0'}
page_t = requests.get(url_t, headers=hdr)
print(page_t.status_code)
# got 200 code for this
Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40