1

I have a project at school with an e-commerce website that I need a large number of images to complete. So I consulted the code to download images from Youtube: John Watson Rooney But I was having problem downloading half of the image Url changed to '' so I can't continue downloading.

import requests
from bs4 import BeautifulSoup
import os
import base64

def imagedown(url, folder):
    try:
        os.mkdir(os.path.join(os.getcwd(), folder))
    except:
        pass
    os.chdir(os.path.join(os.getcwd(), folder))
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    images = soup.find_all('img', class_='styles__productImage--3ZNPD')
    for image in images:
        name = image['alt']
        link = image['src']
        with open(name.replace('/', '').replace('?', '').replace('=', '').replace('|', '') + '.jpg', 'wb') as f:
            im = requests.get(link)
            f.write(im.content)
            print('Writing: ', name)

imagedown('https://www.redbubble.com/shop/?gender=gender-men&iaCode=u-tees&page=2&query=dog&sortOrder=relevant&style=u-tee-regular-crew', 'Images')

enter image description here

I don't know where the error lies, please help me, thanks

Jung Hana
  • 23
  • 3
  • You don't have to download this image. You already have it, it is just `base64` encoded. Check [this thread](https://stackoverflow.com/questions/33870538/how-to-parse-data-uri-in-python) for solutions how to decode it. – RJ Adriaansen May 23 '21 at 07:07

2 Answers2

0

Those images are encoded as base64 strings so you don't need to download them, you can simply save them as follows:

import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import os
import re

def imagedown(url, folder):
    try:
        os.mkdir(os.path.join(os.getcwd(), folder))
    except:
        pass
    os.chdir(os.path.join(os.getcwd(), folder))
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    images = soup.find_all('img', class_='styles__productImage--3ZNPD')
    for image in images:
        name = image['alt']
        link = image['src']
        ext = None
        data = None

        if link.startswith('data'):
            with urlopen(link) as response:
                if link.startswith('data:image/gif'):
                    ext = '.gif'
                data = response.read()
        else:
            ext = os.path.splitext(link)[1]
            data = requests.get(link).content
               
        with open(name.replace('/', '').replace('?', '').replace('=', '').replace('|', '') + ext, 'wb') as f:
            f.write(data)
            print('Writing: ', name)

imagedown('https://www.redbubble.com/shop/?gender=gender-men&iaCode=u-tees&page=2&query=dog&sortOrder=relevant&style=u-tee-regular-crew', 'Images')
Isma
  • 14,604
  • 5
  • 37
  • 51
  • Is there any way that can help me to download the image as a jpg, I can't view the image as a gif. It's weird that when I inspect I can still see the normal url but when using python it's base64 encoded. – Jung Hana May 23 '21 at 09:16
  • Can you send one of the URLs? – Isma May 23 '21 at 10:08
  • https://ih1.redbubble.net/image.1911237172.8285/ssrco,classic_tee,mens,101010:01c5ca27c6,front_alt,square_product,600x600.jpg This is one of the product URLs that I use to download – Jung Hana May 23 '21 at 14:57
  • Yeah, but the jpeg got changed to gif somehow in the behind productions. That problem prevented me from continuing to download the image – Jung Hana May 28 '21 at 01:20
  • Sorry for the delay, can you help me please. I don't know why the url was changed to a gif tag. – Jung Hana May 28 '21 at 06:17
  • Do you have to use BeautifulSoup, maybe it will be easier with selenium. – Isma May 28 '21 at 07:12
  • See my new answer for a working solution ,-) – Isma May 29 '21 at 09:32
0

The problem is that the images you are trying to get are loaded dynamically when you scroll down the page so instead of using BeautifulSoup, you could try with a web scraping tool like Selenium:

First, install it with pip install selenium.

Download the Google chrome driver from the following URL and save it, the latest version of Google Chrome must be installed in your computer as well.

And here is the script to download the images, as you can see, I implemented a loop to scroll down slowly to make sure the images are loaded before downloading them:

import os
import urllib.request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

# set path to your chromedriver.exe
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
driver.maximize_window()
# launch URL
driver.get("https://www.redbubble.com/shop/?gender=gender-men&iaCode=u-tees&page=2&query=dog&sortOrder=relevant&style=u-tee-regular-crew")
driver.implicitly_wait(3)

body = driver.find_element_by_tag_name('body')

# Scroll the page slowly by sending down arrow keys and pausing between each stroke
scroll_pause_time = 0.05
scroll_strokes = 400

for i in range(scroll_strokes):
    # Send arrow down
    body.send_keys(Keys.ARROW_DOWN)
    # Wait to load page
    time.sleep(scroll_pause_time)

# find images
images = driver.find_elements_by_class_name('styles__productImage--3ZNPD')
save_folder = 'c:/temp/images/'

for image in images:
    name = image.get_attribute('alt')
    link = image.get_attribute('src')
    ext =  os.path.splitext(link)[1]
    urllib.request.urlretrieve(link, save_folder + name.replace('/', '').replace('?', '').replace('=', '').replace('|', '').replace('"', '') + ext)

# close browser
driver.quit()
Isma
  • 14,604
  • 5
  • 37
  • 51