0

First time trying make something in python. Decided that it was a img-scraper. it's found and download all images, but they are all corrupted. Found info about wrong unicode in BeatySoup, but I did not understand what was wrong. img in jpg, gif and png.

I don't use urllib because site blocking it (403 forbidden)

from bs4 import BeautifulSoup
import requests
import time

url = 'some url'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
images = []
for img in soup.findAll('img', {'class': '_images'}):
    images.append(img.get('data-url'));

for i in range(len(images)):
    s = images[i]
    cutname = s.split("/")[-1]
    filename = cutname[:cutname.find("?")]
    f = open(filename,'wb') 
    f.write((requests.get(s)).content)
    f.close()
    time.sleep(0.5)
Momo
  • 13
  • 3
  • Cannnot say without showing what is the value you are getting as image url – Nabin Jan 20 '19 at 07:47
  • in value 'url' paste https://www.webtoons.com/en/comedy/bluechair/ep-366-husk/viewer?title_no=199&episode_no=538 – Momo Jan 20 '19 at 09:49

1 Answers1

1

Seems like you need to pass some headers. The bottom part of the code to write the image file out is by @Deepspace

from bs4 import BeautifulSoup
import requests

url = "https://www.webtoons.com/en/comedy/bluechair/ep-366-husk/viewer?title_no=199&episode_no=538"
headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
        'Referer' : url
    }

r = requests.get(url, headers = headers)
soup=BeautifulSoup(r.content,'lxml')
imgs=[link['data-url'] for link in soup.select('#_imageList img')]

counter = 0
for img in imgs:
    counter = counter + 1
    filename = 'image' + str(counter) + '.jpg'
    with open(filename, 'wb') as handle:
        response = requests.get(img, stream=True, headers = headers)

        if not response.ok:
            print(response)

        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks a lot! And can you please explain 4 line (headers={...}) or where i can read? – Momo Jan 20 '19 at 20:40
  • Hi, See the custom headers section of the documentation [here](http://docs.python-requests.org/en/master/user/quickstart/) – QHarr Jan 20 '19 at 20:42