I was trying to download materials off the web site which didn't allow bots. I could manage to pass a header to Request this way:
url = 'https://www.superdatascience.com/machine-learning/'
req = urllib.request.Request(url, headers = {'user-agent':'Mozilla/5.0'})
res = urllib.request.urlopen(req)
soup = bs(res,'lxml')
links = soup.findAll('a')
res.close()
hrefs = [link.attrs['href'] for link in links]
# Now am filtering in zips only
zips = list(filter(lambda x : 'zip' in x, hrefs))
I hope that Kiril forgives me for that, honestly I didn't mean anything unethical. Just wanted to make it programmatically.
Now when I have all the links for zip files I need to retrieve the content off them. And urllib.request.urlretrieve
obviously forbids downloading through a script. So, I'm doing it through URLOpener:
opener = urllib.request.URLopener()
opener.version = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
for zip in zips:
file_name = zip.split('/')[-1]
opener.retrieve(zip, file_name)
the above returned:
HTTPError: HTTP Error 301: Moved Permanently
I tried without a loop, having thought something silly, and made it with a method addheaders
:
opener = urllib.request.URLopener()
opener.addheaders = [('User-agent','Mozilla/5.0')]
opener.retrieve(zips[1], 'file.zip')
But it returned the same response with no resource being loaded.
I've two questions: 1. Is there something wrong with my code? and if yes what did I do wrong? 2. is there another way to make it working?
Thanks a lot in advance !