What is the fastest way to take status codes of a huge URL list in parallel

Question

I have spent some time to find the best and fastest way of getting status codes of a huge URL list but no progress.

Here is my code:

import multiprocessing
import time

def check(url):
    """Send request to url and get a HTTP status code"""
    try:
        response = requests.head(url)
    except requests.exceptions.RequestException:
        return "404"
    return str(response.status_code)

def multiprocessing_func():
    url_list = [
        # A huge list of URLs
    ]
    pool = multiprocessing.Pool()
    start = time.time()
    pool.map(check, url_list)
    done = time.time()
    print("time: {}".format(done - start))

My laptop is a little bit slow, however:

when url_list has 1 URL, it takes 6 seconds to be done,

with 8 items, it takes 10 seconds,

with 32 items, it takes 24 seconds,

with 128 items, it takes 77 seconds and so on...

Why this time is growing in multiprocessing?

I think it should take nearly 6 or 7 seconds to be done(nearly the same amount of one URL).

What did I do wrong?

How can I do this in the fastest way (suppose I have a list with 10000 URLs)?

any suggestion would be appreciated.

best regards.

multiprocessing runs it on each individual core on your machine. try using `threading.Thread()` Especially when your computer is old and might only have 4 cores. — A random coder, Dec 03 '20 at 20:31
also, are you just looking for the status code of a page or the contents as well? @shahab — A random coder, Dec 03 '20 at 20:34
Thanks for your comment. I'm just looking for the status code, at the beginning I was using requests.get(url) and then changed it to requests.head(url) so the execution time decreased. — shahab, Dec 03 '20 at 20:44

score 2 · Answer 1 · answered Dec 03 '20 at 20:39

You are going to want to use the asyncio and aiohttp modules for this. threading works as well though (with the caveat of being slightly slower).

import asyncio
import aiohttp

async def check(url, session):
    session : aiohttp.ClientSession
    async with session.get(url) as response:
        return response.status

async def multiprocessing_func():
    url_list = [
        #your huge list of urls :)
    ]
    tasks = []
    async with aiohttp.ClientSession() as session:
        for i in url_list:
            tasks.append(asyncio.create_task(check(i, session)))
        return await asyncio.gather(*tasks)
    
loop = asyncio.get_event_loop()
loop.run_until_complete(multiprocessing_func())

if you don't know what the poop is happening here, then I suggest that you read asyncio and aiohttp's documentation

Thank you for your answer. but with this code when I double the URLs, the execution time doubles. for 32 items it took 26 seconds and for 64 items 50 seconds. — shahab, Dec 03 '20 at 21:36
That I can't do anything about. The way that requests is structured is that your big O is n^2 (i.e when the inputs doubles the execution time doubles) — A random coder, Dec 04 '20 at 23:30

score 1 · Answer 2 · answered Dec 03 '20 at 20:38

1

see async python calls aiohttp

Python threading is not efficient due to GIL

answered Dec 03 '20 at 20:38

Avihay Tsayeg

444
4
10

What is the fastest way to take status codes of a huge URL list in parallel

2 Answers2