What's the fastest way to check thousands of urls?

Question

I need to check at least 20k urls to check if the url is up and save some data in a database.

I already know how to check if an url is online and how to save some data in the database. But without concurrency it will take ages to check all urls so whats the fastest way to check thousands of urls?

I am following this tutorial: https://realpython.com/python-concurrency/ and it seems that the "CPU-Bound multiprocessing Version" is the fastest way to do, but I want to know if that it is fastest way or if there are better options.

Edit:

Based on the replies I will update the post comparing Multiprocessing and Multithreading

Example 1: Print "Hello!" 40 times

Threading

With 1 thread: 20.152419090270996 seconds
With 2 threads: 10.061403036117554 seconds
With 4 threads: 5.040558815002441 seconds
With 8 threads: 2.515489101409912 seconds

Multiprocessing with 8 cores:

It took 3.1343798637390137 seconds

If you use 8 threads it will be better the threading

Example 2, the problem propounded in my question:

After several tests if you use more than 12 threads the threading will be faster. For example, if you want to test 40 urls and you use threading with 40 threads it will be 50% faster than multiprocessing with 8 cores

Thanks for your help

Go with asyncio or threading approach – RomanPerekhrest Oct 17 '19 at 11:48 — RomanPerekhrest, Oct 17 '19 at 11:48

CircArgs · Answer 1 · 2019-10-17T14:05:38.933

3

I think you should use pool:pool docs

Based on some results here: mp vs threading SO

I would say always use multiprocessing. Perhaps if you expect your requests to take a long time to resolve then the context switching benefits of threads would overcome the brute force of multiprocessing

Something like

import multiprocessing as mp
urls=['google.com', 'yahoo.com']

with mp.Pool(mp.cpu_count()) as pool:

        results=pool.map(fetch_data, urls)

Edit: to address comments about a set number of subprocesses I've shown how to request processes equal to your number of logical threads

edited Oct 17 '19 at 14:05

answered Oct 17 '19 at 11:50

CircArgs

570
6
16

I would remove `processes=4` so that `Pool` can get the value from `os.cpu_count()` – F. Leone Oct 17 '19 at 11:56
@CircArgs you nedd specify that mp could be slower than multithreading end even one thread-one process program is the case when there is no heavy computations. – Artiom Kozyrev Oct 17 '19 at 12:49
@Artion Kozyrev threads share io scheduling which with thousands of requests will probably be a bottleneck. While processes take longer to spin up, if this data fetching is anything significant it will probably be faster and definitely simple to use multiprocessing – CircArgs Oct 17 '19 at 13:01
@CircArgs in the described situation all heavy load is in the server side and mp module will not make it faster, all your processess will do is to wait answer from server. If OP need to do some computations with data which he want to receive from the servers, he should first recieve answers from servers and only then use mp module to work with the received data. – Artiom Kozyrev Oct 17 '19 at 13:05
Thanks for your reply, I am checking all the information that you gave me. What do you think about the other reply? As I wont write the same memory space maybe it will be better the threading in this case. I am just asking, I will make some test to compare both solutions – redunicorn Oct 17 '19 at 21:43
@redunicorn if you go with threads don't reinvent the wheel. For your case you can use ThreadPoolExecutor – CircArgs Oct 17 '19 at 21:56
@CircArgs Thanks for your suggestion, I will use it and update it with the results – redunicorn Oct 17 '19 at 22:46

score 3 · Accepted Answer · answered Oct 17 '19 at 12:44

3

To say that multiprocessing is always the best choice is incorrect, multiprocessing is best only for heavy computations!

The best choice for actions which do not require heavy computations, but only IN/OUT operations like database requets or requests of remote webapp api, is module threading. Threading can be faster than multiprocessing since multiprocessing need to serialize data to send it to child process, meanwhile trheads use the same memory stack.

Threading module

Typical activity in the case is to create input queue.Queue and put task (urls in you case in it) and create several workers to take tasks from the Queue:

import threading as thr
from queue import Queue


def work(input_q):
    """the function take task from input_q and print or return with some code changes (if you want)"""
    while True:
        item = input_q.get()
        if item == "STOP":
            break

        # else do some work here
        print("some result")


if __name__ == "__main__":
    input_q = Queue()
    urls = [...]
    threads_number = 8
    workers = [thr.Thread(target=work, args=(input_q,),) for i in range(threads_number)]
    # start workers here
    for w in workers:
        w.start

    # start delivering tasks to workers 
    for task in urls:
        input_q.put(task)

    # "poison pillow" for all workers to stop them:

    for i in range(threads_number):
        input_q.put("STOP")

    # join all workers to main thread here:

    for w in workers:
        w.join

    # show that main thread can continue

    print("Job is done.")

answered Oct 17 '19 at 12:44

Artiom Kozyrev

3,526
2
13
31

Thanks for your reply! It looks very good, I think that you have more knowledges than me but before accept any answer I am going to compare both solutions to know what takes less time. Just one question, if all threads are executed in only one or can be executed only in one core; how can I know what it is the maximum amount of threads that can be handle in the server? – redunicorn Oct 17 '19 at 21:48
@redunicorn do not worry about cores and CPU of you PC when you do IO operations. You can run even 1000 threads if you have ditinct 1000 urls. All load in on servers' side. Experiment with number of threads to choose the best number for the case. You can do the following experiment to understand what is better in specific situation: create function which print Hello 5 times, try single thread approach just print Hello 40 times, then create e.g. 8 thread which print Hello 5 times each, do the same using mp module. Probably the fastest will be single thread or mulithread. – Artiom Kozyrev Oct 18 '19 at 06:17
@redunicorn then do another experiment: write function whcih print hello and wait 0.5 seconds (repeat 5 times in function). Single thread is print Hello wait 0.5 (40 times). The fastest will be 8 threads, the second will be 8 processes by mp module. The 3rd experiment is to create function which do print(Hello), then find sum of all squares of numbers in huge range like 1 000 000 or even more and then print it. The best one will be mp in case. – Artiom Kozyrev Oct 18 '19 at 06:23
@redunicorn And sereral words about mp.Pool and ThreadPoolExecutor they are simpified API for the simplest case when you have several absolutely independent tasks, but when you need any to communicate between threads or processess you should go with threading.Thread pr or multiprocessing.Process, depending on what you need IO or heavy load computations. – Artiom Kozyrev Oct 18 '19 at 06:26
Hi @artiom-kozyrev, I updated my question with the results from my tests. The threading version is much faster, thanks for your help and explanations to understand it better! Just one more quick question where you can help me, I am using mysql to save the data in the database and I am creating one cursor for each url that it is tested in one thread; after I test the url I save the data and close the connection (previously I increased the max mysql connections to 1000 connections) – redunicorn Oct 19 '19 at 06:42
and I am creating exactly the same number of threads than urls (in my first test I am doing with 5k urls) I am getting timeouts errors when connecting to Mysql (it seems that it happen with more than 900 threads). Does that mean that I can only open 900 threads as maximum? If so, how can I know what is the maximum number of threads that I can open? Just trying or is there a more professional way to check it? I will leave you here the full log of my error: https://pastebin.com/0vEFyEytcc – redunicorn Oct 19 '19 at 06:43
@redunicorn Unfortunately I can't open your link, I get "This page is no longer available. It has either expired, been removed by its creator, or removed by one of the Pastebin staff." The fact that you get timeout errors is not a problem of your computer, but a problem of database which can't handle more than the defined maxinum number of simultaneous connnections. You have two options - decrese number of theads or use threading.Semaphore to limit number of threads which are allowed to write to db simultaneously. – Artiom Kozyrev Oct 19 '19 at 16:12
@redunicorn So when your thread get answer from remote server use with s:then what you wabt do do contruction. s - is semaphore you should provide semaphore object in args of a function you put in a thread. The semaphore is designed to limit number of threads which can use any resource simultaneously. If you need example of how to use semaphore, feel free to ask me. – Artiom Kozyrev Oct 19 '19 at 16:15
@artiom-kozyrev Sorry for the delay but I couldn't check it until now. About the pastebin is weird... maybe I destroyed it by mistake... I have been checking your semaphore recommendation, although it took me a bit to understand it (because I also faced some issues managing database connections), I think that I finally got a working version: https://pastebin.com/EBJquPfN I am not sure if I did it in the right way, if you could take a look I would be grateful. And one question, about the semaphore, having in mind that I did a mysql pool with size of 32 – redunicorn Oct 28 '19 at 16:27
@artiom-kozyrev Do you think that it will be a good increase the semaphore to 32? – redunicorn Oct 28 '19 at 16:28
@redunicorn if your threads cooperate with urls on different web-servers, you can go even without semaphore, semaphore usually used when you work e.g. with one db which can't cope with more than n connections simultaneously – Artiom Kozyrev Oct 28 '19 at 17:10
@redunicorn I checked link, if you want to put answer to request to db, you should choose the amount of threads for concrete db, if your db can cope with 32 attempts simultaneously, you can choose 32 for semaphore – Artiom Kozyrev Oct 28 '19 at 17:16
@artiom-kozyrev If I increase the semaphore to 32 with 100 urls there isn't any problem. But if I increase the urls to 4k it gives me this error: https://pastebin.com/JpE8UQ9e It seems that it can't handle the mysql pool properly. I have tried to implement it without semaphores: https://pastebin.com/aZb0tEBy (which is similar to the semaphore version) but it gives me timeout errors: https://pastebin.com/LvAFagzq – redunicorn Oct 29 '19 at 12:21
@redunicorn timeout error indicates that you db can't handle some connections, probably it is busy with other attempts. You can limit total number of threads or create semaphore with appropriate thread number (to have access to db) – Artiom Kozyrev Oct 29 '19 at 17:31
@redunicorn consider two points, you have 1000 distinct url, so you can send 1000 requests in different threads, when answer is received you can send it to Queue, then output thread take it from the Queue and send to db. I guess that db is much faster then the remote sites. – Artiom Kozyrev Oct 29 '19 at 17:36
1

Nice answer, upvoted! Also starting from Python 3.10 at least using Python C API it would be possible to start multiple sub-interpreters within one process, meaning that you'll be possible to use threads instead of processes, also there will be no global GIL, but local to each sub-interpreter. – Arty May 06 '21 at 13:46
1

@Arty I use `aiohttp` for such tasks a lot now, also mix `aiohttp` with `multiprocessing`, which child processes are used to store separate instances of `aiohttp` or if if I have some CPU bound tasks child processes are used as consumers in producer (`aiohttp` server) cosumers way. It is very interesting news that Python 3.10 will allow separate interpreters in different threads, look like alternative to `mp` module. I started play around with C recently with desire to write some extensions to Python in C + to try Golang then, since I do backend at work and Go looks good choice for it – Artiom Kozyrev May 06 '21 at 16:12
1

@ArtiomKozyrev First this feature of thread-level sub-interpreters and individual GILs will be avaialable in Python C API. Currently this feature is exeprimental and only possible to use when you recompile python from source with enabled C-define `#define EXPERIMENTAL_ISOLATED_SUBINTERPRETERS 1`. But after some testing I'm sure that Python 3.10 or maybe 3.11 will have this feature in release Python. But already right now you can download sources of Python, define `#define EXPERIMENTAL_ISOLATED_SUBINTERPRETERS 1` and compile it and feature will work. – Arty May 06 '21 at 16:49
@Arty I have not tried to write any Python extensions in C yet, but looks like very interesting feature. – Artiom Kozyrev May 06 '21 at 16:54

score 0 · Answer 3 · answered Oct 24 '19 at 15:22

I currently use Multiprocessing with Queues, it works fast enough for what I use it for.

Similarly to Artiom's solution above, I set the number of processes to 80 (currently), use the "workers" to pull the data, send this to the queues and once finished, go through the returned results and handle them depending on the queue.

What's the fastest way to check thousands of urls?

3 Answers3