0

I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.

I have a list of urls and I am trying to use urllib.request.urlretrieve to download all the images.

The output location (savepath) updates so it adds 1 to the current highest number in the folder.

I've tried a bunch of different ways but urlretrieve only saves the image from the last url in the list. Is there a way to download all the images in the url list?

to_download=['url1','url2','url3','url4']

for t in to_download:
    urllib.request.urlretrieve(t, savepath)

This is the code I was trying to use to update the savepath every time

def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
    if os.path.isfile(os.path.join(photos, f)):
        file_name = os.path.splitext(f)[0]
        try:
            file_num = int(file_name)
            if file_num > highest_num:
                highest_num = file_num
        except ValueError:
            'The file name "%s" is not an integer. Skipping' % file_name

output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file
Todd Burus
  • 963
  • 1
  • 6
  • 20
badger
  • 3
  • 2

2 Answers2

0

Are you updating savepath? If you pass the same savepath to each loop iteration, it is likely just overwriting the same file over and over.

Hope that helps, happy coding!

Sam
  • 1,406
  • 1
  • 8
  • 11
0

as suggested by @vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:

from urllib import request

to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']

for i, url in enumerate(to_download):
    save_path = f'website_{i}.txt'
    print(save_path)
    request.urlretrieve(url, save_path)

which you may want to contract into:

from urllib import request

to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']

[request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]

see:

FOR SECOND PART OF THE QUESTION:

Not sure what you are trying to achieve but:

def getNextFilePath(photos):
    file_list = os.listdir(photos)
    file_list = [int(s) for s in file_list if s.isdigit()]
    print(file_list)
    max_id_file = max(file_list)
    print(f'max id:{max_id_file}')
    output_file = os.path.join(output_folder, str(max_id_file + 1))
    print(f'output file path:{output_file}')
    return output_file

this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1

I guess that this will replace the save_path in your example.

Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path. The bellow code be a working example using the iterrator:

import os
from urllib import request
photo_folder = os.path.curdir


def getNextFilePath(photos):

    file_list = os.listdir(photos)
    print(file_list)
    file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
    if not file_list:
        return 0
    print(file_list)
    max_id_file = max(file_list)
    #print(f'max id:{max_id_file}')
    #output_file = os.path.join(photo_folder, str(max_id_file + 1))
    #print(f'output file path:{output_file}')
    return max_id_file

def download_pic(to_download):
    start_id = getNextFilePath(photo_folder)


    for i, url in enumerate(to_download):
        save_path = f'{i+start_id}.png'
        output_file = os.path.join(photo_folder, save_path)
        print(output_file)
        request.urlretrieve(url, output_file)


You should add handling exception etc, but this seems to be working, if I understood correctly.

Je Je
  • 508
  • 2
  • 8
  • 23
  • that works thanks. I was trying to use a piece code (i found) to update the save name every time it runs so as not to overwrite the files in there and to give the images unique names. Is there a way of getting this functionality with enumerate? – badger Jun 03 '20 at 22:01
  • would you mind editing your qustion with an example? I guess if you have list of tuples or a dictionnary (dict), yes enumerate will work. – Je Je Jun 03 '20 at 22:13
  • I have added it in, it generates a new number based off what exists in the output folder already – badger Jun 03 '20 at 22:49