0

I have a nice URL structure to loop through:

https://marco.ccr.buffalo.edu/images?page=0&score=Clear
https://marco.ccr.buffalo.edu/images?page=1&score=Clear
https://marco.ccr.buffalo.edu/images?page=2&score=Clear
...

I want to loop through each of these pages and download the 21 images (JPEG or PNG). I've seen several Beautiful Soap examples, but Im still struggling to get something that will download multiple images and loop through the URLs. I think I can use urllib to loop through each URL like this, but Im not sure where the image saving comes in. Any help would be appreciated and thanks in advance!

for i in range(0,10):
    urllib.urlretrieve('https://marco.ccr.buffalo.edu/images?page=' + str(i) + '&score=Clear')

I was trying to follow this post but I was unsuccessful: How to extract and download all images from a website using beautifulSoup?

Andre
  • 360
  • 1
  • 7
  • 19

1 Answers1

3

You can use requests:

from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os

@contextlib.contextmanager
def get_images(url:str):
  d = soup(requests.get(url).text, 'html.parser') 
  yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]

n = 3 #end value
os.system('mkdir MARCO_images') #added for automation purposes, folder can be named anything, as long as the proper name is used when saving below
for i in range(n):
  with get_images(f'https://marco.ccr.buffalo.edu/images?page={i}&score=Clear') as links:
    print(links)
    for c, [link, ext] in enumerate(links, 1):
       with open(f'MARCO_images/MARCO_img_{i}{c}.{ext}', 'wb') as f:
           f.write(requests.get(f'https://marco.ccr.buffalo.edu{link}').content)

Now, inspecting the contents of the MARCO_images directory yields:

print(os.listdir('/Users/ajax/MARCO_images'))

Output:

['MARCO_img_1.jpg', 'MARCO_img_10.jpg', 'MARCO_img_11.jpg', 'MARCO_img_12.jpg', 'MARCO_img_13.jpg', 'MARCO_img_14.jpg', 'MARCO_img_15.jpg', 'MARCO_img_16.jpg', 'MARCO_img_17.jpg', 'MARCO_img_18.jpg', 'MARCO_img_19.jpg', 'MARCO_img_2.jpg', 'MARCO_img_20.jpg', 'MARCO_img_21.jpg', 'MARCO_img_3.jpg', 'MARCO_img_4.jpg', 'MARCO_img_5.jpg', 'MARCO_img_6.jpg', 'MARCO_img_7.jpg', 'MARCO_img_8.jpg', 'MARCO_img_9.jpg']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • I was writing my response when you updated haha. This works great! I set directory vs creating one but otherwise works great! Thank you. – Andre Jul 30 '18 at 18:18
  • @Andre Glad to help! – Ajax1234 Jul 30 '18 at 18:19
  • Is there a way to add some logic if the image on that page is jpeg or png or will the type not matter if a png is saved as .jpg? – Andre Jul 30 '18 at 18:23
  • Seems to work. I dont see it downloading beyond page 1 even though n is set to 3. Also, can you explain the 'yield' line within your function just to help me understand a bit? Thanks so much. – Andre Jul 30 '18 at 19:06
  • @Andre Please see my recent edit. It should work now. `yield` is part of the contextmanager implementation. – Ajax1234 Jul 30 '18 at 19:10
  • I think its now working right. When I run n=3 I get the 63 images. The only aesthetic thing is the naming convention of the images since there is no ongoing counter appending to the image name, but don't worry about that Im super appreciate that this works, thank you! – Andre Jul 30 '18 at 19:19