0

This code is generating 13k urls and then checking if response = 200. I have to check each one because there is no other way to know if they exist or not. If it is then proceeds to download an image.

It also checks if the image already exists in the folder. It's working but it's taking hours to finish.

I'm using requests, shutil, os and tqdm. I'm new to python and I was researching and found asyncio and aiohttp, watched a couple of tutorials but didn't manage to make it work.

def downloader(urls):
    for i in tqdm(urls):
        r = requests.get(i, headers=headers, stream=True)
        if r.status_code == 200:
            name, path = get_name_path(i)
            if check_dupe(name) == False:
                save_file(path, r)
                
folder_path = create_dir()
urls = generate_links()
downloader(urls)
wjandrea
  • 28,235
  • 9
  • 60
  • 81
zezad
  • 15
  • 4
  • 1
    Lazy solution: Use a `multiprocessing.dummy.Pool` (`.dummy` makes it backed by threads) and factor out the contents of the loop into a function that you can call the `Pool`'s `.imap_unordered` with. Since the work is largely I/O bound, threading will work even on the GIL-constrained CPython reference interpreter. – ShadowRanger Aug 30 '21 at 18:46
  • `get_name_path(i)` and `check_dupe(name)` are likely to be cheap. Do them before you do anything "remote". Once you have a candidate, call `requests.head()` to determine the status rather than `requests.get()`. If you have a candidate to save and a 200 response you can then `get()` the body/image to `save_file(path, r)` – JonSG Aug 30 '21 at 19:26
  • Related? [How to speed up API requests?](/q/34512646/4518341) – wjandrea Aug 30 '21 at 20:56

1 Answers1

1

you can also python ray .

You can follow this steps: create n number of workers such as for example 10.

worker = 10.

distribute urls to for example to n_number of workers different lists (). you can use numpy and use np.arraysplit function for that

distributed_urls = np.array_split(url_lists, worker)

start ray with

ray.init(num_cpus = workers)

do ray remote

@ray.remote(max_calls=1)
def worker(urls_ls)
  downloader(urls = urls_ls)

all_workers = []
for index, i in enumerate(range(workers)):
  all_workers.append(worker.remote(distributed_urls[index])

ray.get(all_workers)

By doing, this you distribute work load in to 10 different workers. You can assign any number of workers depending on your resources available.

you can check more detail here: https://ray.io/

user96564
  • 1,578
  • 5
  • 24
  • 42