Why the python threads is so frustrating?

Question

# test.py

import threading
import time
import random
from itertools import count

def fib(n):
  """fibonacci sequence
  """
  if n < 2:
    return n
  else:
    return fib(n - 1) + fib(n - 2)

if __name__ == '__main__':
    counter = count(1)
    start_time = time.time()
    def thread_worker():
        while True:
            try:
                # To simulate downloading
                time.sleep(random.randint(5, 10))
                # To simulate doing some process, will take about 0.14 ~ 0.63 second
                fib(n=random.randint(28, 31))
            finally:
                finished_number = counter.next()
                print 'Has finished %d, the average speed is %f per second.' % (finished_number, finished_number/(time.time() - start_time))

    threads = [threading.Thread(target=thread_worker) for i in range(100)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()

The above is my test script. The thread_worker function takes at most 10.63 seconds to run once. I started 100 threads and expected the results is ~10 times per second. But the actual results were frustrating as following:

...
Has finished 839, the average speed is 1.385970 per second.
Has finished 840, the average speed is 1.386356 per second.
Has finished 841, the average speed is 1.387525 per second.
...

And if i commented "fib(n=random.randint(28, 31))" out, the results is expected:

...
Has finished 1026, the average speed is 12.982740 per second.
Has finished 1027, the average speed is 12.995230 per second.
Has finished 1028, the average speed is 13.007719 per second.
...

Has finished 1029, the average speed is 12.860571 per second.

My question is why it's so slow? I expected ~10 per second. How to make it faster? fib() function is just to simulate doing some process. e.g. extracting data from big html.

You can't just start an arbitrary number of threads and expect everything to work n times faster. Threads don't work like that. — Daniel Roseman, Sep 03 '13 at 06:20
i assume your fibonacci is to simulate some process, but if you really wanted fibonacci, memoize helps, (and the binet-formula even better, but for big numbers you ll need longs instead of floats n run into trouble) If you ask no question we cant answer, but i just thought you should know that you could just as well put something else in there, not fibonacci to simulate something that takes a long time to execute) — usethedeathstar, Sep 03 '13 at 06:21
If threads worked this way, everyone would run their code with a billion billion threads, and an iPhone could run the Google search backend. — user2357112, Sep 03 '13 at 06:30
Agree with @userthedeathstar. If you can use memoization for your use-cases, this would significantly boost your performance. Ofcourse, with memoization, you are now doing a tradeoff of memory (space) with CPU (speed). — Manoj Pandey, Sep 03 '13 at 06:45
Do you have 100 processors? That would probably speed it up some. — Crowman, Sep 03 '13 at 06:45
[Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_Law), we've heard of it. — mikołak, Sep 03 '13 at 06:53

score 7 · Answer 1 · edited Jun 20 '20 at 09:12

If I ask you to bake a cake and this takes you an hour and a half, 30 minutes for the dough, and 60 minutes in the oven, by your logic I would expect 2 cakes to take exactly the same amount of time. However there are some things you are missing. First if I do not tell you to bake two cakes in the beginning you have to make the dough twice, wich is now 2 times 30 minutes. now it actually takes you two hours ( you are free to work on the second cakeonce the first is in the oven).

Now lets assume i ask you to bake four cakes, again I do not allow you to make the dough once and split it for four cakes but you have to make it every time. The time we would expect now is 4*30minutes+ one hour for the alst cake to bake. Now for the sake of example assume your wife helps you, meaning you can do the dough for two cakes in parallel. THe time expected now is two hours, since every person has to bake two cakes. However the oven you have can only fit 2 cakes at a time. The time now becomes 30 minutes to makte the first dough, 1h hour to bake it, while you make the second dough, and after the first two cakes are done you put the next two cakes in the oven. WHich take a nother hour. If you add up the times you will see that it now took you 2 and a half hours.

If you take this further and I ask you for thousand cakes it will take you 500 and a half hours.

What has this to do with threads?

Think of making the dough as an initial computation that creates 100% cpu load. Your wife is the second core in a dual core. The oven is a resource, for which your programm generates 50% load.

In real threadiing you have some overhead to start the threads (I told you toi bake the cakes, you have to ask your wife for help whcih takes time), you compete for resources (i.e. memory access)(you and your wife can"t use the mixer at the same time). THe speedup is sub linear even if the number of threads is smaller than the number fo cores.

Furthermore smart programs download their code once (make the dough once), in the main thread and than duplicate it to threads, there is no nead to duplicate a computation. It does not make it faster just because you compute it twice.

Missed the point for the example program because the Fibonacci code is perfectly scalable and will in most other languages lead to the expected linear speed up.. — Voo, Sep 03 '13 at 10:35
@Voo: I can't quite folloe you. FOr fast Fibonacci you can either go with a direct formula (recursion free) or you can use [memorization](http://stackoverflow.com/questions/1988804/what-is-memoization-and-how-can-i-use-it-in-python). And I would like to know of any language where you get speedup by reexecuting the same task? — ted, Sep 03 '13 at 13:20
In any language that doesn't have a GIL or comparable the given program will have a nearly perfect linear speedup with number of cores so the explanation "because things don't scale linearly with CPU cores" while true for many things does not apply to this situation. If he wrote the same program in Java, he would see the expected result (well since his machine probably doesn't have 100 cores obviously only speedup up to N cores, but the principle is clear). That there are better ways to compute fibonacci numbers is immaterial. — Voo, Sep 03 '13 at 14:03
@Voo: from a base perspective you are correct, but I am trying to explain that starting the threads, and synchronizing them on finish takes some time. Further more if both threads need to access the cache/memory they compete for this resource. With most realistic benchmarks you usually have a penalty for this. In short: there might be rare cases where one actually achieves a linear speedup, but usually one will stay slightly under. I agree with you that in this case most part of the thread will propably be run from the local cache of the cpu thus yielding near linear speedup. — ted, Sep 03 '13 at 14:50
Well there are enough embarrassingly parallel problems and problems where you can expect at least logarithmic scaling for this not to be academic. So I think the important part is to explain *why* this is not the case in python and you generally end up with slow downs instead. — Voo, Sep 03 '13 at 16:26

score 4 · Accepted Answer · edited May 23 '17 at 12:05

While Manoj's answer is correct I think it needs more explanation. The python GIL is a mutex used in cpython that essentially will disable any parallel execution of python code. It does not make threaded code slower, nor does it actually prevent the OS from scheduling python threads simultaneously on all your cores. It just makes sure only one thread can execute python byte code at the same time.

What does this mean for you? You essentially do two things:

Sleep: While performing this function no python code is being executed, you just do nothing for 5 to 10 seconds. In the meantime any other thread can do exactly the same thing. Given that the overhead of calling time.sleep is negligible, you could have thousands of threads and it will probably still scale linearly like you expected. This is why everything works as soon as you comment out the fib line. Your average sleep time is 7.5s so you'd expect 15 calculations per second.
A calculation of the Fibonacci sequence: This one is the problem, it is actually executing python code. Let's say it takes about 0.5s per calculation. Now we've seen that you can only run one calculation at the time, no matter how many threads you have. Given that, you'd only get to 2 calculations per second.

Now, it's lower than 15 and 2, mainly because there is some overhead involved. First of all you are printing out data to the screen, this is almost always a surprisingly slow operation. Secondly, you're using 100 threads, which means that you're constantly switching between 100 thread stacks (even if they're sleeping), which is not a lightweight operation.

Note that threading can still be very useful though. For example for blocking calls where the execution is not done by python itself but some other resource. This could be waiting for the result of a socket, a sleep like in your example or even a calculation that is done outside of python itself (e.g. many numpy calculations).

Wonderful!! Do you have any suggestion to seed up my script? Does multiprocessing help? — redice, Sep 03 '13 at 07:19
@redice: It might, but it also might not. In your case, where you are doing sleeping and independent maths, it would. In the real case that you are "simulating" it may not. — Lennart Regebro, Sep 03 '13 at 07:28
In my real case, the sleep() is downloading pages, and the fib() is extracting data from html. — redice, Sep 03 '13 at 07:51
In that case threads will still help because the network IO will be the largest factor there and can mostly be parallelized — Voo, Sep 03 '13 at 10:38

score 0 · Answer 3 · edited Sep 03 '13 at 06:58

0

Python threads use Global Interpretor Lock (GIL) to synchronize access of Python interpretor state. Compared to other threads like POSIX threads, usage of GIL can make Python threads significantly slower, especially when dealing with multiple cores. This is well-known. Here is a really good presentation on the same: www.dabeaz.com/python/UnderstandingGIL.pdf‎

edited Sep 03 '13 at 06:58

KillianDS

16,936
4
61
70

answered Sep 03 '13 at 06:37

Manoj Pandey

4,528
1
17
18

6

The GIL does not make threads by itself slower. The GIL just prevents threads to be run simultaneous on different cores, this is a subtle but very important distinction. – KillianDS Sep 03 '13 at 06:43
@KillianDS Agreed. Thank you for that clarification. My thought was that with so many threads, OP (redice) is seeing a slower speed due to GIL. Once a thread is scheduled, of course, GIL does not affect its performance! – Manoj Pandey Sep 03 '13 at 06:47
I wish threads did make it slower - I would rather it actually run things simultaneously slower, since the whole point of threading is I/O bound code - I don't really care if it runs slower if it covers latency much better. There is some ridiculous stipulation in the cpython community that multithread improvements cannot affect singlethread. Dream on. – doug65536 Dec 13 '21 at 02:33

score 0 · Answer 4 · answered Sep 03 '13 at 09:20

You're looking for a faster solution. Memoizing results helps.

import collections
import functools


class Memoized(object):
    """Decorator. Caches a function's return value each time it is called.
    If called later with the same arguments, the cached value is returned
    (not reevaluated).
    """

    def __init__(self, func):
        self.func = func
        self.cache = {}

    def __call__(self, *args):
        if not isinstance(args, collections.Hashable):
            # uncacheable. a list, for instance.
            # better to not cache than blow up.
            return self.func(*args)
        if args in self.cache:
            return self.cache[args]
        else:
            value = self.func(*args)
            self.cache[args] = value
            return value

    def __repr__(self):
        """Return the function's docstring."""
        return self.func.__doc__

    def __get__(self, obj, objtype):
        """Support instance methods."""
        return functools.partial(self.__call__, obj)


if __name__ == '__main__':
    @Memoized
    def fibonacci(n):
        """Return the nth fibonacci number

        :param n: value
        """
        if n in (0, 1):
            return n
        return fibonacci(n - 1) + fibonacci(n - 2)

    print(fibonacci(35))

Try to run it with and without the @Memoized decorator.

Recipe taken from http://wiki.python.org/moin/PythonDecoratorLibrary#Memoize.

Why the python threads is so frustrating?

4 Answers4

What has this to do with threads?