0

I have the following script that prints the src path and sizes of all images on a specified url:

from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
from PIL import Image
import requests

url="https://example.com/"

session = HTMLSession()
r = session.get(url)

b  = requests.get(url)
soup = BeautifulSoup(b.text, "lxml")

images = soup.find_all('img')

for img in images:
    if img.has_attr('src') :
        imgsize = Image.open(requests.get(img['src'], stream=True).raw)
        print(img['src'], imgsize.size)

It works fine for some url's but for others i get the following error:

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x10782e900>

Is there a way to overcome this error?

chappers
  • 466
  • 1
  • 6
  • 17

1 Answers1

1

Without having your specific url, I can't go and see why that's happening. But you can put in there a try/except so your script doesn't crash and will continue onto the next img

from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
from PIL import Image
import requests

url="https://example.com/"

session = requests.Session()
r = session.get(url)

b  = requests.get(url)
soup = BeautifulSoup(b.text, "lxml")

images = soup.find_all('img')

for img in images:
    if img.has_attr('src') :
        try:
            img_link = img['src']
            if img_link.startswith('data:image'):
                img_link = img['data-src']
            imgsize = Image.open(requests.get(img_link, stream=True).raw)
            print(img_link, imgsize.size)
        
        except Exception as e:
            print(e)
chitown88
  • 27,527
  • 4
  • 30
  • 59