Python 2.6 vs 2.7 Multi-Threaded performance Issue (futex)

Question

I have a simple Monte-Carlo Pi computation program. I tried running it on 2 different boxes(same hardware with slightly different kernel versions). I am seeing significant performance drop in one case(twice the time). Without threads, performance is mostly same. Profiling execution of the programs indicated that the one which is slower spends lesser time per futex call.

Is this related to any kernel parameters?
Can CPU flags affect futex performance ? /proc/cpuinfo indicates that cpu flags are slightly different.
Is this someway related to python version?

Linux(3.10.0-123.20.1 (Red Hat 4.4.7-16)) Python 2.6.6

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.69   53.229549           5  10792796   5385605 futex

Profile Output
============== 
256 function calls in 26.189 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   39   26.186    0.671   26.186    0.671 :0(acquire)

Linux(3.10.0-514.26.2 (Red Hat 4.8.5-11)) Python 2.7.5

 % time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.69   94.281979           8  11620358   5646413 futex

Profile Output
==============
259 function calls in 53.448 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 38   53.445    1.406   53.445    1.406 :0(acquire)

Test Program

import random
import math
import time
import threading
import sys
import profile

def find_pi(tid, n):
    t0 = time.time()
    in_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()

        dist = math.sqrt(pow(x, 2) + pow(y, 2))
        if dist < 1:
            in_circle += 1

    pi = 4.0 * (float(in_circle)/float(n))
    print 'Pi=%s - thread(%s) time=%.3f sec' % (pi, tid, time.time() - t0)
    return pi

def main():
        if len(sys.argv) > 1:
            n = int(sys.argv[1])
        else:
            n = 6000000

        t0 = time.time()
        threads = []
        num_threads = 5
        print 'n =', n
        for tid in range(num_threads):
            t = threading.Thread(target=find_pi, args=(tid,n,))
            threads.append(t)
            t.start()

        for t in threads:
                t.join()

#main()
profile.run('main()')
#profile.run('find_pi(1, 6000000)')

problem is that GIL prevents pure python threads to execute simultaneously. If you want performance gain use multiprocessing instead. — Jean-François Fabre, Nov 01 '17 at 09:24
Here I am comparing same python program in 2 identical hardware and slightly different kernel and one of them is taking twice the time and that time seems to be spend in futex. — Anoop, Nov 01 '17 at 09:28
I'm just saying that multithreading won't help you much here. So comparing 2 underperformant programs is pointless. Migrate to multiprocessing, you'll see improvement. — Jean-François Fabre, Nov 01 '17 at 09:36
Goal is not to optimize python program, but to understand the reason for futex performance drop between 2 python/kernel versions. — Anoop, Nov 01 '17 at 09:52
@cdarke - I ran it multiple times and every time, the one running 2.7 is taking twice the time. — Anoop, Nov 01 '17 at 09:54
Is `num_threads` in your test run really `1`? If so, can you share your measurements in case your abandon `thread` and just call `find_pi` directly from the `main`? It is interesting both whether there still is 2x drop and how the time without thread compares to the time with thread. — SergGr, Nov 03 '17 at 13:59
No I tested by changing num_threads=1 to 5. When I pasted the code, looks like it was set to 1. I will change it. There is no visible drop when num_threads=1 — Anoop, Nov 04 '17 at 04:28
Your test doesn't seem fair as you use python 2.6 vs 2.7. What if the same version of python is used in both cases? Will the drop be the same? — Oleg Kuralenko, Nov 04 '17 at 11:28
Your first issue is that your testing isn't well conceived. You need to compare equal things with **one** varying parameter to measure its impact. For instance, trying on Windows your program with Python2.6 and Python 2.7, I obtain the same order of magnitude (43s for Python 2.6 and 42s for Python 2.7), which indicates Python is not the cause of this lag. You just have to test the 2 other remaining components of the test. By the way, if one box is running more "stuff" than the other, that may also cause this lag. — Adonis, Nov 06 '17 at 11:31
I tried `n = 6000000` and `num_threads = 5` on python 2.7 and python.26, only get 60~70 calls of `futex`, what is your setup exactly? — georgexsh, Nov 06 '17 at 22:54
I tried on 8 core 8 GB boxes (n=6000000 and num_threads=5). Agreed that interpreter is different. I felt time taken for each futex call is more in the 2.7 variant. Hence I was trying to understand if any Kernel/HW parameter is resulting in increased futex time. — Anoop, Nov 07 '17 at 12:10

score 0 · Answer 1 · answered Nov 07 '17 at 18:37

0

It looks very likely that this is due to some changes in the kernel code between those two versions. There was a bug in the futex code for the kernel, causing some processes to deadlock. Fixing the bug may have caused this hit to performance. The changelog for 3.10.0-514 (for CentOS) mentions a number of changes to [kernel] futex.

answered Nov 07 '17 at 18:37

soundstripe

1,454
11
19

This could be the possible reason. I tried with same program in Java with locks and observed similar increase in time for futex calls. – Anoop Nov 08 '17 at 11:06

score 0 · Answer 2 · answered Nov 08 '17 at 10:04

I'm not familiar with kernel and CPU flags, So I can not tell you CPU flag or kernel flag will affect results.

So it's not answering all of your questions, but just satisfying my interest, tested your code on CentOS 7.4.1708 (Linux 3.10.0-693.2.2.el7.x86_64 x86_64) using different python versions (2.6.6, 2.7.5, 3.6.3)

Python version 2.6.6

Profile Output
==============
256 function calls in 19.838 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    39   19.019    0.488   19.019    0.488 :0(acquire)
    18    0.000    0.000    0.000    0.000 :0(allocate_lock)
    13    0.000    0.000    0.000    0.000 :0(append)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.98    6.319220          55    114693      2293 futex
  1.03    0.068830           1     55485           madvise
  0.10    0.006869          95        72           munmap
...

Python version 2.7.5

Profile Output
==============
247 function calls in 23.293 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    34   22.717    0.668   22.717    0.668 :0(acquire)
    18    0.047    0.003    0.047    0.003 :0(allocate_lock)
    13    0.000    0.000    0.000    0.000 :0(append)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.54    7.360687         196     37613       667 futex
  0.04    0.002798           4       629       492 open
  0.01    0.000918           4       235       203 stat
...

Python version 3.6.3

Profile Output
==============
213 function calls in 17.818 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5    0.000    0.000    0.000    0.000 :0(__enter__)
     5    0.000    0.000    0.000    0.000 :0(__exit__)
    25   15.923    0.637   15.923    0.637 :0(acquire)
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.71    0.032639         244       134        38 futex 
  1.90    0.000742           1       849           clock_gettime
  1.74    0.000680           4       160           mmap
...

After executed multiple times, and I got almost same results so I picked random result. Python 2.6.6 is slightly faster than 2.7.5, and 3.6.3 is a little faster than 2.6.6.

And strace result is almost same with 2.6.6 and 2.7.5, but 3.6.3 has a quite different result.

So, in your question,

Is this related to any kernel parameters?

Can CPU flags affect futex performance? /proc/cpuinfo indicates that cpu flags are slightly different.

I don't know,

Is this someway related to python version?

YES.

score -1 · Answer 3 · answered Nov 03 '17 at 14:16

I don't think you can get strict answer.

Futex is something that relates to kernel. Here is man page.

tl;dr - for instance - threads are scheduled by kernel, and if high priority thread is blocked by low priority one, there is something called priority inversion. So drop observed can be due to kernel flags. Other point is getting time - that goes to kernel to get realtime value.

On the other hand you have only one thread started, so it should not be a problem. Your threads are not interfering, so it should not be anything like locking. I see acquire called, but looking on time spent shows that it is about join() at the end to wait for the thread(s).

Can you execute the test - let's say 50 times, and provide statistics? That would take an hour, but one minute test can be influenced by almost anything.

BTW you missed (imports):

import random
import math

I was testing the program by changing num_threads from 1 to 5. When I pasted the code, looks like it was set to 1. I have updated it to actual value 5. There is no visible drop when num_threads=1. Futex seems to be called by python Global Interpreter Lock. — Anoop, Nov 04 '17 at 04:30
Yes, I've tried to go with sub processes, but that is not a part of your question. Anyway - I think that 2.6 - 2.7 has differences in implementation. Probably 2.7 is safer, and slower. Besides trying with bigger number of experiments - maybe 3.X is worth to check? — Michał Zaborowski, Nov 07 '17 at 08:04

Python 2.6 vs 2.7 Multi-Threaded performance Issue (futex)

3 Answers3