Downloading images using urllib.request.retrieve

Question

Im trying to download images off a website but I keep getting getting this error:

HTTP Error 403: Forbidden

This is the function I created, to be able to do this:

    def download_images(url,knife):
      '''
      download_images is a function which will extract pictures of the knives in csgo
      url is the list of url which the images will be extracted from
      images of 'knife' will be downloaded
      '''

      page = requests.get(url)

      #Use beautifulsoup to extract the image urls
      soup = BeautifulSoup(page.content, 'html.parser') 

      #Pull all image labels from the website with instances of img_alt
      for img in soup.find_all('img', alt = True):
        #Find the url and labels of the knives
        if knife in img['alt']:
          #Download the images with the correct labels
          urllib.request.urlretrieve(img['src'],'{}.png'.format(img['alt']))

That means that the website doesn't permit robots to crawl its website. — ds_secret, Aug 09 '19 at 18:47
The site may be rejecting the request based on the user agent or some other issue. — IronMan, Aug 09 '19 at 18:48
There are multiple urls, some worked but here is one that did not work: https://csgostash.com/img/weapons/s/navaja_knife.png @ds_secret — Hashim Abu Sharkh, Aug 09 '19 at 18:48
This is one that did: https://steamcdn-a.akamaihd.net/apps/730/icons/econ/default_generated/weapon_knife_karambit_aq_oiled_light_large.52f9229d2960f5557a893a99cc679f1181b48d98.png — Hashim Abu Sharkh, Aug 09 '19 at 18:51
Basically, the ones that start with csgostash don't work or that is what I believe — Hashim Abu Sharkh, Aug 09 '19 at 18:51

ds_secret · Answer 1 · 2019-08-12T16:34:12.057

0

You should change the user agent. There are many user agents that one can use. A list of user agents is available here. To make urllib use a different user agent, you should add this code. Additionally, you could use wget and use the option -U and then a user agent string (an example of which is 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4').

Implementing WGET

import os

def download_images(url,knife):
  '''
  download_images is a function which will extract pictures of the knives in csgo
  url is the list of url which the images will be extracted from
  images of 'knife' will be downloaded
  '''

  page = requests.get(url)

  #Use beautifulsoup to extract the image urls
  soup = BeautifulSoup(page.content, 'html.parser') 

  #Pull all image labels from the website with instances of img_alt
  for img in soup.find_all('img', alt = True):
    #Find the url and labels of the knives
    if knife in img['alt']:
      #Download the images with the correct labels
      os.system("wget --convert-links -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' " + knife)

edited Aug 12 '19 at 16:34

answered Aug 09 '19 at 18:54

ds_secret

338
3
18

I tried changing the user agent by doing the following: page = requests.get(url,headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' }) but that did not seem to work – Hashim Abu Sharkh Aug 09 '19 at 19:03
Should I add this line: urllib.request.urlretrieve(img['src'],'{}.png'.format(img['alt'])) after implementing WGET? WGET also seemed to be not working. – Hashim Abu Sharkh Aug 09 '19 at 19:08
@HashimAbuSharkh Do you have WGET installed on your computer? – ds_secret Aug 09 '19 at 19:21
@HashimAbuSharkh I tried to WGET https://csgostash.com/img/weapons/s/navaja_knife.png, and it did work without giving me a 403. – ds_secret Aug 09 '19 at 19:24
Yes, I installed it before using pip3 install wget, but how did you make it manage to work, can you post your code please? – Hashim Abu Sharkh Aug 09 '19 at 22:30
@HashimAbuSharkh I did it on command line. Try it on command line first. Also wget on pypi is not the original GNU WGET. Type `wget` onto your command line to see if it is installed. If WGET is installed, type `wget --convert-links -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' https://csgostash.com/img/weapons/s/navaja_knife.png` into command line and see if it downloads. If WGET is not downloaded, you should download it from https://www.gnu.org/software/wget/ and then run the above command. – ds_secret Aug 12 '19 at 16:32

Downloading images using urllib.request.retrieve

1 Answers1

Implementing WGET