I'm doing something simple, but after an error I ran into some "unsimple" trouble. This probably has to do with the fact that I'm a mathematican and not coder, thus bad coding praxis.
What I want to do: I have a data frame with 122 489 rows and 6 columns, where one of the columns named 'Image FS'
contains URL's to images that I want to download and save at a particular location locally. Essentially I want to download and save 122 489 (4K) images that will later be used for a recommendation system. I used urllib.request
and the code below, which worked flawlessly.
import urllib.request
import pandas as pd
import time
urlList = pd.read_csv('C:/Users/.../imageFS_20220414.csv',
encoding = 'unicode_escape', engine ='python')
tic = time.perf_counter()
counter = 0
for url in urlList['Image FS']:
file_name = url.split('/')[-1]
#print("Downloading file: %s"%file_name)
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
imgURL = url
storage_location = "E:/Test/"+file_name+".PNG"
urllib.request.urlretrieve(imgURL, storage_location)
counter = counter + 1
toc = time.perf_counter()
print("-----------------------------------------------------------------------")
print(f"Downloaded {counter} images in {(toc - tic)/60:0.4f} minutes.")
Here is the kicker: After about 2 days of running and completion of roughly 68 000 images I got the following error HTTPError: Internal Server Error
and the program stopped. I assume this error happened because of a temporary disturbance in the website or something like that.
Needless to say, I did want to re-download all the 68 000 images, in the for loop but only what's left to download. I simply started where the program stopped by putting urlList['Image FS'][68000:]
in the for loop and letting it run again. The script finished but when I look in the folder where all the images are saved, there are "only" 101 610 images.
My guess here is that one can't simply "start where it left" like I have done by passing in the next index.
Questions:
- Was there a way to prevent this from happening using a try/catch block somehow? To tell the program if a server error happens, don't stop but wait 5 minutes and try again or just skip the iteration that is broken, if such an iteration exists?
- Now I need to download the remaining ~ 21 000 images. How can I do this without having to restart the entire program and wait for 4 days? I assume I need to identify which images are downloaded by a check: if image exists in downloaded folder, skip to next URL, else download.
Any advice that can help me save as much time as possible or teach me how to avoid problems like these in the future are greatly appreciated.