0

The basics of the code are below. I know for a fact how I'm retrieving these pages works for other URLs, as I just wrote a script scraping a different page in the same way. However with this specific URL it keeps throwing "urllib.error.HTTPError: HTTP Error 404: Not Found" in my face. I replaced the URL with a different one (https://www.premierleague.com/clubs), and it works completely fine. I'm very new to python so perhaps there's a really basic step or piece of knowledge I haven't found, but resources I've found on line relating to this didn't seem relevant. Any advice would be great, thanks.

Below is the barebones of the script:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

myurl = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"

uClient = uReq(myurl)
Danny
  • 435
  • 1
  • 5
  • 17
  • it just means that the server is telling you the page doesn't exist. It *doesn't want to serve your script the page*, perhaps because they want to stop people from scraping it. There isn't anything we can do to help though. – Martijn Pieters Sep 18 '18 at 16:38
  • Guessing the problem is that spiders crawling might be blocked; you can change the user agent to circumvent it. See https://stackoverflow.com/questions/48489443/urllib-error-httperror-http-error-404-not-found-python for more information (The solution prescribed in that post seems to work for your url too) (If you want to use urllib [this post](https://stackoverflow.com/questions/24226781/changing-user-agent-in-python-3-for-urrlib-request-urlopen) tells you how to alter the user agent). – jpw Sep 18 '18 at 16:39
  • Ah I see, interesting. Didn't know that was a thing. @jpw that's pretty much solved it for me, so if you fancy making this an answer I can choose as a solution – Danny Sep 18 '18 at 16:45
  • @Danny Sure, reposting my comment as the answer. happy to have helped. – jpw Sep 18 '18 at 16:49

2 Answers2

1

The problem is most likely that the site you are trying to access is actively blocking spiders crawling; you can change the user agent to circumvent it. See this question for more information (the solution prescribed in that post seems to work for your url too).

If you want to use urllib this post tells you how to alter the user agent.

jpw
  • 44,361
  • 6
  • 66
  • 86
-1

It is showing a 404 because it thinks the website doesn't exist.

You can try with a different module like requests.

This is the code for requests

import requests

resp = requests.get("https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1")

print(resp.text) # gets source code

I hope it works!

ljmc
  • 4,830
  • 2
  • 7
  • 26