0

I have a csv file containing around 1.4 million image links, which I want to download. I want to remove repeated links from the csv, then assign a unique filename to each one (there is an ID from the image link which I am using).

Some of the images have been downloaded and I have saved their links in a text file.

completed_file = 'downloaded_links.txt'
if os.path.isfile(completed_file):
    with open(completed_file) as f:
        downloaded = f.read().split('\n')[:-1]
else:
    downloaded = []
main_file_name = 'all_images.csv'
with open(main_file_name) as f:
    a = [{k: v for k, v in row.items()} for row in csv.DictReader(f, skipinitialspace=True)]

This is the loop where I am filtering the links

from random import randint
import re

h = []  # list of filtered dicts
seen = set()  # unique names
seen_links = set()  #unique links
for i in a:
    if i['image_url'] in downloaded:
        continue
    if i['image_url'] in seen_links:
        continue
    seen_links.add(i['image_url'])
    my_name = re.search(r'img=(.*?)&', i['image_url'], re.I).groups()[0]
    while my_name in seen:
        temp = my_name.split('.jpg')
        my_name = temp[0] + str(randint(1, 9)) + '.jpg'
    seen.add(my_name)
    di = {'name': my_name, 'image_url': i['image_url']}
    h.append(di)

The loop does exactly what I want (skip already downloaded links and assign unique filenames to the new ones), but It is taking more than 3 hours to do so. What can I do to speed it up or some logic to rewrite it in a way so it runs faster?

This is how I write to downloaded_links.txt

with open(completed_file, 'w') as f:  #downloaded is the list containing downloaded links
    for i in downloaded:
        f.write(f'{i}\n')
goku
  • 183
  • 14
  • 2
    Please include an example of the files as text. Always [Provide a Minimal, Reproducible Example (e.g. code, data, errors, current output, expected output), as text](https://stackoverflow.com/help/minimal-reproducible-example). – Trenton McKinney Aug 14 '20 at 18:32
  • 2
    As an aside, `{k: v for k, v in row.items()}` is a wordy way of writing `dict(items)`, which fundamentally makes your whole list comprehension an overnengineered way of writing `a = list(csv.DictReader(f, skipinitialspace=True))` – juanpa.arrivillaga Aug 14 '20 at 18:38
  • `my_name = temp[0] + str(randint(1, 9)) + '.jpg'` seems to me like it would not be a good way to go about things. As the number of new names you have to generate increases, the probability of generating a unique name will fall. Can you give an example of the sort of data you are working with? – juanpa.arrivillaga Aug 14 '20 at 18:41
  • Also, this: `if i['image_url'] in downloaded:` will be very slow. `downloaded` should be a set, not a lsit – juanpa.arrivillaga Aug 14 '20 at 18:42
  • @juanpa.arrivillaga well the link itself contains an ID for the image (98% of the time it is unique, the while loop is to make sure the 2% gets handled). do you have another way of doing it? – goku Aug 14 '20 at 18:44
  • @TrentonMcKinney I added the code which I used to write downloaded_links.txt ,hope it answers you comment – goku Aug 14 '20 at 18:47
  • See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script), the first thing you should do when trying to optimize your code. – martineau Aug 14 '20 at 18:48
  • @juanpa.arrivillaga downloaded list is 100% non repeated elements – goku Aug 14 '20 at 18:48
  • @goku so what? What does that have to do with anything I said? Do you understand *why I said you should use a set rather than a list*? How big is `downloaded`? – juanpa.arrivillaga Aug 14 '20 at 18:50
  • @juanpa.arrivillaga it's around 1.2 mil – goku Aug 14 '20 at 18:53
  • @goku *well that definitely explains it*. **Use a `set`**. – juanpa.arrivillaga Aug 14 '20 at 18:55
  • @juanpa.arrivillaga I will use that (convert my list to set) and see how that affects the speed. – goku Aug 14 '20 at 19:00
  • it will vastly improve the performance – juanpa.arrivillaga Aug 14 '20 at 19:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219837/discussion-between-goku-and-juanpa-arrivillaga). – goku Aug 14 '20 at 19:02

1 Answers1

0

As mentioned in the comments you are better of using set instead of lists.

import os
os.system(f'sed -i '1d' {main_file_name}')  # delete header of csv (should work in any linux)
a = set(line.strip().split(',')[1] for line in open(main_file_name)) # assuming image_url is the second column
downloaded = set(line.strip() for line in open(completed_file))
# "a" is already a set so there is no need to loop against it
# you may also replace you pattern with re.compile(...)
pattern = re.compile(r'img=(.*?)&', i, re.I)
for i in a:
    if i in downloaded:
        continue
    my_name = pattern.search(i).groups()[0]
    ...
wishmaster
  • 1,437
  • 1
  • 10
  • 19