Loop through webpages via BeautifulSoup and download all images

Question

I would like to go through the below web pages and save the respective images using python:

Examples (total of 10.000 websites):

https://cryptopunks.app/cryptopunks/cryptopunk0001.png
https://cryptopunks.app/cryptopunks/cryptopunk0002.png
https://cryptopunks.app/cryptopunks/cryptopunk0002.png
https://cryptopunks.app/cryptopunks/cryptopunk9999.png

My goal is to use the images in a GAN afterward for project work and create images by doing so.

I tried adapting the below code to the above exemplary websites, but unfortunately, I cannot make it work. (Loop through webpages and download all images):

from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os

@contextlib.contextmanager
def get_images(url:str):
  d = soup(requests.get(url).text, 'html.parser') 
  yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]

n = 3 #end value
os.system('mkdir MARCO_images') #added for automation purposes, folder can be named anything, as long as the proper name is used when saving below
for i in range(n):
   with get_images(f'https://marco.ccr.buffalo.edu/images?page={i}&score=Clear') as links:
     print(links)
     for c, [link, ext] in enumerate(links, 1):
        with open(f'MARCO_images/MARCO_img_{i}{c}.{ext}', 'wb') as f:
             f.write(requests.get(f'https://marco.ccr.buffalo.edu{link}').content)

Could anyone please help me out?

Thanks a lot!

Tech123 · Accepted Answer · 2022-06-27T18:17:29.510

0

I have gone ahead and used only requests, os, the images will be save in the New folder (or any folder you name it). How ever it is rather a slow method to download 9999 images so you can use threading (threads to carry out the calling of the function faster).

import requests
import os
import threading

os.mkdir("New folder")


def get_images(url, index):
    r = requests.get(url)

    with open(f"New folder\image_{index}.png", "wb") as img:
        img.write(r.content)
    img.close()


n = 10000
for i in range(1, n):
    t1 = threading.Thread(target=get_images, args=(f"https://cryptopunks.app/cryptopunks/cryptopunk{i}.png", i))
    t1.start() 
    # As you know the website you can easily access it and by just providing the number you can download the images
    # the loop will run from 1 to 9999 as you wanted.

As for my computer it took the program about 7 seconds to download 9 images without using threads, and it only took the program about 2 seconds to download 9 images with using threads. So using threads can do multiprocessing.

edited Jun 27 '22 at 18:17

answered Jun 27 '22 at 17:15

Tech123

16
4

I used the script and it worked well. I noticed though that not all files can be used as after the download of 40 images the server blocks downloading further pngs. Further download is possible after around 30 seconds though. Is it possible to integrate the range size of 4o png downloads and commence after a break of about 30 secs? – Sterik Jun 28 '22 at 12:29
You could do it in a number of ways, you could run the program for every 40 images but it will be stressful thing, so you could just use the time module and you can use it to pause every 40 images for 30 seconds and then start to download the images again. – Tech123 Jun 28 '22 at 16:20
I have posted another answer with the time module. – Tech123 Jun 28 '22 at 16:27
The first time module didn't work for me somehow. The website got too much traffic and images were not downloaded as I thought they would be. I solved it manually :D I tried the second option you provided too and it also didnt work. In both cases the script downloads all pictures right away leading to the fact that only the first or not even 40 are downloaded as images. Thanks a lot though – Sterik Jun 29 '22 at 00:12

score 0 · Answer 2 · answered Jun 28 '22 at 16:26

import requests
import os
import threading
import time

os.mkdir("New folder")


def get_images(url, index):
    r = requests.get(url)

    with open(f"New folder\image_{index}.png", "wb") as img:
        img.write(r.content)
    img.close()


pause_at = 40  # where we want to pause the program for 30 seconds
n = 10000
for i in range(1, n):
    if i == pause_at:  # Every 40 times the program will pause and wait for 30 seconds
        print("Pausing for 30 seconds")
        time.sleep(30)  # Then continue to download the other 40 images, it will repeat until no more images
        pause_at += 40  # We increase the pause at variable by 40
    t1 = threading.Thread(target=get_images, args=(f"https://cryptopunks.app/cryptopunks/cryptopunk{i}.png", i))
    t1.start() 
    # As you know the website you can easily access it and by just providing the number you can download the images
    # the loop will run from 1 to 9999 as you wanted.

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Jun 28 '22 at 20:47

Loop through webpages via BeautifulSoup and download all images

2 Answers2