0

I was trying to download materials off the web site which didn't allow bots. I could manage to pass a header to Request this way:

url = 'https://www.superdatascience.com/machine-learning/'
req = urllib.request.Request(url, headers = {'user-agent':'Mozilla/5.0'})
res = urllib.request.urlopen(req)
soup = bs(res,'lxml')
links = soup.findAll('a')
res.close()
hrefs = [link.attrs['href'] for link in links]

# Now am filtering in zips only
zips = list(filter(lambda x : 'zip' in x, hrefs))

I hope that Kiril forgives me for that, honestly I didn't mean anything unethical. Just wanted to make it programmatically.

Now when I have all the links for zip files I need to retrieve the content off them. And urllib.request.urlretrieve obviously forbids downloading through a script. So, I'm doing it through URLOpener:

opener = urllib.request.URLopener()
opener.version = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'    
for zip in zips:
    file_name = zip.split('/')[-1]
    opener.retrieve(zip, file_name)

the above returned:

HTTPError: HTTP Error 301: Moved Permanently

I tried without a loop, having thought something silly, and made it with a method addheaders:

opener = urllib.request.URLopener()
opener.addheaders = [('User-agent','Mozilla/5.0')]
opener.retrieve(zips[1], 'file.zip')

But it returned the same response with no resource being loaded.

I've two questions: 1. Is there something wrong with my code? and if yes what did I do wrong? 2. is there another way to make it working?

Thanks a lot in advance !

Vlad
  • 181
  • 2
  • 10
  • The 301 isn't an error, it's a valid response from the server telling you precisely what the problem is - the resource have been moved. https://stackoverflow.com/q/22150023/7432 – Bryan Oakley Jan 07 '18 at 20:30
  • Bryan Oakley I've edited the question. Would you know how to handle 301 response within `opener.retrieve`? – Vlad Jan 07 '18 at 20:55

0 Answers0