0

I am trying to extract all the images from below URL, However, I don't understand the HTTP Error 403: Forbidden, Can it be taken care of during error handling, or simply the URL cant be scraped due to limitations?

from bs4 import BeautifulSoup
from urllib.request import urlopen
import urllib.request


def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + "images found.")
    print("downloading to current directory ")
           
    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        filename=each.split('/')[-1]
        urllib.request.urlretrieve(each,filename)
    return image_links

get_images("https://opensignal.com/reports/2019/04/uk/mobile-network-experience")
EricA
  • 403
  • 2
  • 14

2 Answers2

0

some sites need you to specify User-Agent header

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request


def make_soup(url):
    site = url
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = Request(site, headers=hdr)
    page = urlopen(req)
    return BeautifulSoup(page)
Benoit F
  • 479
  • 2
  • 10
  • Thanks @Benoit F , I get it now However still an error exists after modifying the make_soup function – EricA May 10 '19 at 11:06
  • you now have scrapped all the "src" attributes of images in that page, but you are failing to download them take a look here : https://stackoverflow.com/questions/18408307/how-to-extract-and-download-all-images-from-a-website-using-beautifulsoup – Benoit F May 10 '19 at 13:01
0

You can use this function for image scraping. using img tag along not useful nowadays .we can implement something like below, that will fulfill the requirement. It's not relay on any tags so wherever image link is present it will grab it.

def extract_ImageUrl(soup_chunk):
    urls_found = []
    for tags in soup_chunk.find_all():
        attributes = tags.attrs
        if str(attributes).__contains__('http'):
            for links in attributes.values():
                if re.match('http.*\.jpg|png',str(links)):
                    if len(str(links).split()) <=1:
                        urls_found.append(links)
                    else:
                        link = [i.strip() for i in str(links).split() if re.match('http.*\.jpg|png',str(i))]
                        urls_found = urls_found + link
    print("Found {} image links".format(len(urls_found)))
    return urls_found

It's an initial thought, require updates to make it very better.

Dhamodharan
  • 199
  • 10
  • @Thanks Dhamodharan , As mentioned in the edit , its not an image but other tags – EricA May 10 '19 at 15:10
  • May be you are trying to hit continuously which leads your IP got exposed hence its forbid to send the response. Try to use proper headers and use proxy rotations. – Dhamodharan May 11 '19 at 06:20