I have a csv file containing around 1.4 million image links, which I want to download. I want to remove repeated links from the csv, then assign a unique filename to each one (there is an ID from the image link which I am using).
Some of the images have been downloaded and I have saved their links in a text file.
completed_file = 'downloaded_links.txt'
if os.path.isfile(completed_file):
with open(completed_file) as f:
downloaded = f.read().split('\n')[:-1]
else:
downloaded = []
main_file_name = 'all_images.csv'
with open(main_file_name) as f:
a = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)]
This is the loop where I am filtering the links
from random import randint
import re
h = [] # list of filtered dicts
seen = set() # unique names
seen_links = set() #unique links
for i in a:
if i['image_url'] in downloaded:
continue
if i['image_url'] in seen_links:
continue
seen_links.add(i['image_url'])
my_name = re.search(r'img=(.*?)&', i['image_url'], re.I).groups()[0]
while my_name in seen:
temp = my_name.split('.jpg')
my_name = temp[0] + str(randint(1, 9)) + '.jpg'
seen.add(my_name)
di = {'name': my_name, 'image_url': i['image_url']}
h.append(di)
The loop does exactly what I want (skip already downloaded links and assign unique filenames to the new ones), but It is taking more than 3 hours to do so. What can I do to speed it up or some logic to rewrite it in a way so it runs faster?
This is how I write to downloaded_links.txt
with open(completed_file, 'w') as f: #downloaded is the list containing downloaded links
for i in downloaded:
f.write(f'{i}\n')