How to only download images from URL that have yet not been downloaded?

Question

I'm doing something simple, but after an error I ran into some "unsimple" trouble. This probably has to do with the fact that I'm a mathematican and not coder, thus bad coding praxis.

What I want to do: I have a data frame with 122 489 rows and 6 columns, where one of the columns named 'Image FS' contains URL's to images that I want to download and save at a particular location locally. Essentially I want to download and save 122 489 (4K) images that will later be used for a recommendation system. I used urllib.request and the code below, which worked flawlessly.

import urllib.request
import pandas as pd
import time


urlList = pd.read_csv('C:/Users/.../imageFS_20220414.csv',
                      encoding = 'unicode_escape', engine ='python')

    tic = time.perf_counter()
    counter = 0
    
    for url in urlList['Image FS']:
        file_name = url.split('/')[-1]
        #print("Downloading file: %s"%file_name)
        opener = urllib.request.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')]
        urllib.request.install_opener(opener)
        imgURL = url
        storage_location = "E:/Test/"+file_name+".PNG"
        urllib.request.urlretrieve(imgURL, storage_location)
        counter = counter + 1
        
    toc = time.perf_counter()
    
    print("-----------------------------------------------------------------------")
    print(f"Downloaded {counter} images in {(toc - tic)/60:0.4f} minutes.")

Here is the kicker: After about 2 days of running and completion of roughly 68 000 images I got the following error HTTPError: Internal Server Error and the program stopped. I assume this error happened because of a temporary disturbance in the website or something like that.

Needless to say, I did want to re-download all the 68 000 images, in the for loop but only what's left to download. I simply started where the program stopped by putting urlList['Image FS'][68000:] in the for loop and letting it run again. The script finished but when I look in the folder where all the images are saved, there are "only" 101 610 images.

My guess here is that one can't simply "start where it left" like I have done by passing in the next index.

Questions:

Was there a way to prevent this from happening using a try/catch block somehow? To tell the program if a server error happens, don't stop but wait 5 minutes and try again or just skip the iteration that is broken, if such an iteration exists?
Now I need to download the remaining ~ 21 000 images. How can I do this without having to restart the entire program and wait for 4 days? I assume I need to identify which images are downloaded by a check: if image exists in downloaded folder, skip to next URL, else download.

Any advice that can help me save as much time as possible or teach me how to avoid problems like these in the future are greatly appreciated.

Also: https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-without-exceptions/82852#82852 — gvee, Apr 20 '22 at 08:56
Yes! all of them are unique and are actually parts of the URLs. The names are embedded in the URLs and the names are what makes the URLs unique. — Parseval, Apr 20 '22 at 08:56
It sounds as if you can check whether the filenames already exist, from the URL, and simply skip those URLs. Of course, that does not guarantee that those images are correctly downloaded, nor that the actual images have changed on the host. But that will always be the case. — 9769953, Apr 20 '22 at 10:38
@9769953 - So lets say I want to download image A. Then I have to check image A against 122 489 images and decide to install or not install. Isn't this check quite expensive since it has to be repeated about 21 000 times? — Parseval, Apr 20 '22 at 10:52
I think you can safely hold the filenames of 68000 downloaded images in memory. Put them in a set, then test `if file_name in already_downloaded: continue` in your loop, before the `urllib` lines. — 9769953, Apr 20 '22 at 10:57
@9769953 - Interesting, seems it should work. I will try this and get back if I hit any bump. However It needs to hold names of 101 610 images now since these have already been downloaded. — Parseval, Apr 20 '22 at 11:04
Ah; I saw 68000, but there is also 101610 mentioned. Anyway, these are the same order of magnitude. For comparison, a random set of 10,000,000 strings of 20 characters each is about 1.5 GB in memory for me. And testing for the occurrence of a string in that set is near-instantaneous. — 9769953, Apr 20 '22 at 12:54
@9769953- Worked well, thanks! If you add a short answer I can accept it? — Parseval, Apr 20 '22 at 19:40

How to only download images from URL that have yet not been downloaded?

0 Answers0