0

I have spent some time to find the best and fastest way of getting status codes of a huge URL list but no progress.

Here is my code:

import multiprocessing
import time

def check(url):
    """Send request to url and get a HTTP status code"""
    try:
        response = requests.head(url)
    except requests.exceptions.RequestException:
        return "404"
    return str(response.status_code)

def multiprocessing_func():
    url_list = [
        # A huge list of URLs
    ]
    pool = multiprocessing.Pool()
    start = time.time()
    pool.map(check, url_list)
    done = time.time()
    print("time: {}".format(done - start))

My laptop is a little bit slow, however:

when url_list has 1 URL, it takes 6 seconds to be done,

with 8 items, it takes 10 seconds,

with 32 items, it takes 24 seconds,

with 128 items, it takes 77 seconds and so on...

Why this time is growing in multiprocessing?

I think it should take nearly 6 or 7 seconds to be done(nearly the same amount of one URL).

What did I do wrong?

How can I do this in the fastest way (suppose I have a list with 10000 URLs)?

any suggestion would be appreciated.

best regards.

shahab
  • 313
  • 3
  • 16

2 Answers2

2

You are going to want to use the asyncio and aiohttp modules for this. threading works as well though (with the caveat of being slightly slower).

import asyncio
import aiohttp

async def check(url, session):
    session : aiohttp.ClientSession
    async with session.get(url) as response:
        return response.status

async def multiprocessing_func():
    url_list = [
        #your huge list of urls :)
    ]
    tasks = []
    async with aiohttp.ClientSession() as session:
        for i in url_list:
            tasks.append(asyncio.create_task(check(i, session)))
        return await asyncio.gather(*tasks)
    
loop = asyncio.get_event_loop()
loop.run_until_complete(multiprocessing_func())

if you don't know what the poop is happening here, then I suggest that you read asyncio and aiohttp's documentation

A random coder
  • 453
  • 4
  • 13
  • Thank you for your answer. but with this code when I double the URLs, the execution time doubles. for 32 items it took 26 seconds and for 64 items 50 seconds. – shahab Dec 03 '20 at 21:36
  • That I can't do anything about. The way that requests is structured is that your big O is n^2 (i.e when the inputs doubles the execution time doubles) – A random coder Dec 04 '20 at 23:30
1

see async python calls aiohttp

Python threading is not efficient due to GIL

Avihay Tsayeg
  • 444
  • 4
  • 10