improve efficiency of loop to filter repeated links

Question

I have a csv file containing around 1.4 million image links, which I want to download. I want to remove repeated links from the csv, then assign a unique filename to each one (there is an ID from the image link which I am using).

Some of the images have been downloaded and I have saved their links in a text file.

completed_file = 'downloaded_links.txt'
if os.path.isfile(completed_file):
    with open(completed_file) as f:
        downloaded = f.read().split('\n')[:-1]
else:
    downloaded = []
main_file_name = 'all_images.csv'
with open(main_file_name) as f:
    a = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)]

This is the loop where I am filtering the links

from random import randint
import re

h = []  # list of filtered dicts
seen = set()  # unique names
seen_links = set()  #unique links
for i in a:
    if i['image_url'] in downloaded:
        continue
    if i['image_url'] in seen_links:
        continue
    seen_links.add(i['image_url'])
    my_name = re.search(r'img=(.*?)&', i['image_url'], re.I).groups()[0]
    while my_name in seen:
        temp = my_name.split('.jpg')
        my_name = temp[0] + str(randint(1, 9)) + '.jpg'
    seen.add(my_name)
    di = {'name': my_name, 'image_url': i['image_url']}
    h.append(di)

The loop does exactly what I want (skip already downloaded links and assign unique filenames to the new ones), but It is taking more than 3 hours to do so. What can I do to speed it up or some logic to rewrite it in a way so it runs faster?

This is how I write to downloaded_links.txt

with open(completed_file, 'w') as f:  #downloaded is the list containing downloaded links
    for i in downloaded:
        f.write(f'{i}\n')

Please include an example of the files as text. Always [Provide a Minimal, Reproducible Example (e.g. code, data, errors, current output, expected output), as text](https://stackoverflow.com/help/minimal-reproducible-example). — Trenton McKinney, Aug 14 '20 at 18:32
As an aside, `{k: v for k, v in row.items()}` is a wordy way of writing `dict(items)`, which fundamentally makes your whole list comprehension an overnengineered way of writing `a = list(csv.DictReader(f, skipinitialspace=True))` — juanpa.arrivillaga, Aug 14 '20 at 18:38
`my_name = temp[0] + str(randint(1, 9)) + '.jpg'` seems to me like it would not be a good way to go about things. As the number of new names you have to generate increases, the probability of generating a unique name will fall. Can you give an example of the sort of data you are working with? — juanpa.arrivillaga, Aug 14 '20 at 18:41
Also, this: `if i['image_url'] in downloaded:` will be very slow. `downloaded` should be a set, not a lsit — juanpa.arrivillaga, Aug 14 '20 at 18:42
@juanpa.arrivillaga well the link itself contains an ID for the image (98% of the time it is unique, the while loop is to make sure the 2% gets handled). do you have another way of doing it? — goku, Aug 14 '20 at 18:44
@TrentonMcKinney I added the code which I used to write downloaded_links.txt ,hope it answers you comment — goku, Aug 14 '20 at 18:47
See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script), the first thing you should do when trying to optimize your code. — martineau, Aug 14 '20 at 18:48
@juanpa.arrivillaga downloaded list is 100% non repeated elements — goku, Aug 14 '20 at 18:48
@goku so what? What does that have to do with anything I said? Do you understand *why I said you should use a set rather than a list*? How big is `downloaded`? — juanpa.arrivillaga, Aug 14 '20 at 18:50
@juanpa.arrivillaga I will use that (convert my list to set) and see how that affects the speed. — goku, Aug 14 '20 at 19:00
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219837/discussion-between-goku-and-juanpa-arrivillaga). — goku, Aug 14 '20 at 19:02

score 0 · Answer 1 · answered Aug 15 '20 at 03:18

As mentioned in the comments you are better of using set instead of lists.

import os
os.system(f'sed -i '1d' {main_file_name}')  # delete header of csv (should work in any linux)
a = set(line.strip().split(',')[1] for line in open(main_file_name)) # assuming image_url is the second column
downloaded = set(line.strip() for line in open(completed_file))
# "a" is already a set so there is no need to loop against it
# you may also replace you pattern with re.compile(...)
pattern = re.compile(r'img=(.*?)&', i, re.I)
for i in a:
    if i in downloaded:
        continue
    my_name = pattern.search(i).groups()[0]
    ...

improve efficiency of loop to filter repeated links

1 Answers1