Parallel Processing in Cython (multithreading)

Question

I am trying a parallelize a piece of code that looks like follows:

%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
cimport openmp
cimport cython
#cimport openmp
"""
%%cython -a
"""
from cpython cimport array
import array
import cython.parallel as cp
from cython.parallel import parallel, prange
"""stuff"""

@cython.boundscheck(False)  # Deactivate bounds checking                                                                  
@cython.wraparound(False)   # Deactivate negative indexing.                                                               
@cython.cdivision(True)     # Deactivate division by 0 checking.
cdef function(arr1, arr2):
    len_out = 3*length
    cdef np.ndarray[np.float64_t, ndim=2] output_arr
    output_arr = np.zeros((len_out, 6))
    cdef np.ndarray[np.float64_t, ndim=1] current_arr1 = arr1
    cdef np.ndarray[np.float64_t, ndim=1] current_arr2 = arr2
    cdef np.ndarray[np.int_t, ndim=2] lkupcoord = np.array(lookup)
    """
    stuff
    """
    #for k in range(length):
    for k in prange(0,length,1, nogil=True, schedule = "static",chunksize = 200,num_threads =2):
     
        #Block 1
        if "check some conditions":
            "do computation"
            output_arr[k,0] = val
            .
            .
            .
            output_arr[k,5] = val
        else:
            "do some computation"
            output_arr[k,0] = val
            .
            .
            .
            output_arr[k,5] = val
        #Block 2

        if "check some conditions":
            "do computation"
            output_arr[k+length,0] = val
            .
            .
            .
            output_arr[k+length,5] = val
        elif:
            "do some computation"
            output_arr[k+2*length,0] = val
            .
            .
            .
            output_arr[k+2*length,5] = val

        #Block 3

        if "check some conditions":
            "do computation"
            output_arr[k,0] = val
            .
            .
            .
            output_arr[k,5] = val
        elif:
            "do some computation"
            output_arr[k,0] = val
            .
            .
            .
            output_arr[k,5] = val
   return output_arr

the length is approximate 2000. Using prange is the same or even slightly worse than using regular range. I am not quite sure why. I don't have any python/GIL errors as far as I can tell as it runs.

I suspect that length is not long enough to benefit from prange, and not worth the extra overhead?

I was wondering if Block 1, Block 2 and Block 3 can be launched on separate threads by wrapping into different functions, and then return output_arr? that would in principle give me a 3x speed up?

I am willing to provide exact code, and some sample input/output and annotate cython file if that helps? It would be greatly appreciated if I could get some help speeding this up

Running on cluster as an interactive job using srun in a jupter notebook. requested resources at the time of the job 1 node with 2 cores and 16G of memory.

NodeName=node443 Arch=x86_64 CoresPerSocket=14
   CPUAlloc=28 CPUErr=0 CPUTot=28 CPULoad=27.09
   AvailableFeatures=node443,centos7
   ActiveFeatures=node443,centos7
   Gres=(null)
   NodeAddr=node443 NodeHostName=node443 
   OS=Linux 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 
   RealMemory=128000 AllocMem=64440 FreeMem=21369 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=sched_mit_binz 
   BootTime=2021-08-11T14:47:16 SlurmdStartTime=2021-08-11T14:47:54
   CfgTRES=cpu=28,mem=125G,billing=28
   AllocTRES=cpu=28,mem=64440M
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

arr1 and arr2 are 1d nd.arrays with entries either 0,1, and their sizes are 1700 and 60 entries respectively.

Time with prange, average time per call: 0.00308809373805772
Time without prange, average time per call: 0.0014715276634957217

Timed using from timeit import default_timer as timer

updated code, this are the computations that are being performed inside the loop over k

for k in range(lenpop):
    q_c = current_c[k]
    i,j = lkupcoord[k]
    i_m = current_m[i]
    j_m = current_m[j]
    if i_m == j_m and i_m == 1.0:
        e = 0.0 #some float
    else:
        e = 0.0 #some float
    
    
    if fabs(i-j) == 1.0:
        bb = 0.0
        
    else:
        bb = 1.0
    if q_c ==0 and bb == 1.0:
        dq = +1.0
        q_n = q_c + dq
        r =  kc
        output_arr[k+2*s, 0] = k
        output_arr[k+2*s, 1] = i
        output_arr[k+2*s, 2] = j
        output_arr[k+2*s,3]  = dq
        output_arr[k+2*s, 4] = r
        output_arr[k+2*s, 5] = 0

    elif q_c == 1 and bb == 1.0:
        
        dq = -1.0
        q_n = q_c + dq
        Qc = (mu-(N-1))/Mc
        Qn = (mu+dq-(N-1))/Mc
        fudge = (
                Mc*(Qn*log(Qn + 0.0001) + (1-Qn)*log(1-Qn + 0.0001))
                - Mc*(Qc*log(Qc + 0.0001) + (1-Qc)*log(1-Qc + 0.0001))
            )
        dF = ((100*(Qn-0.5)**4 -25*(Qn-0.5)**2 + e*q_n)-(100*(Qc-0.5)**4 -25*(Qc-0.5)**2 + e*q_c)-fudge
                 ) #some quartic polynomial in general

        
        r =  kc*exp(-dF)
        
        output_arr[k+2*s, 0] = k
        output_arr[k+2*s, 1] = i
        output_arr[k+2*s, 2] = j
        output_arr[k+2*s, 3] = dq
        output_arr[k+2*s, 4] = r
        output_arr[k+2*s, 5] = 0
    :            
    if i_m == 1:
        
        d_im = -1
        if bb==0.0:
            r = 0.5*c2*i_m + q_c*cy1*i_m*(1-j_m)
        else:
            r = q_c*cy1*i_m*(1-j_m) 
            
        output_arr[k+2*s+1, 0] = k
        output_arr[k+2*s+1, 1] = i
        output_arr[k+2*s+1, 2] = j
        output_arr[k+2*s+1, 3] = d_im
        output_arr[k+2*s+1, 4] = r
        output_arr[k+2*s+1, 5] = 1
    
    elif i_m == 0:
        
        d_im = +1
        if bb==0.0:
            r =  0.5*c2*(1-i_m) + q_c*cx1*(1-i_m)*j_m
        else:
             r = q_c*cx1*(1-i_m)*j_m
        output_arr[k+2*s+1, 0] = k
        output_arr[k+2*s+1, 1] = i
        output_arr[k+2*s+1, 2] = j
        output_arr[k+2*s+1, 3] = d_im
        output_arr[k+2*s+1, 4] = r
        output_arr[k+2*s+1, 5] = 1
    

   
    if j_m == 1:
        
        d_jm = -1
        if bb==0.0:
            r = 0.5*c2*j_m + q_c*cy1*j_m*(1-i_m)
        else:
            r = q_c*cy1*j_m*(1-i_m)
        output_arr[k+2*s+2,0] = k
        output_arr[k+2*s+2,1] = i
        output_arr[k+2*s+2,2] = j
        output_arr[k+2*s+2,3] = d_jm
        output_arr[k+2*s+2,4] = r
        output_arr[k+2*s+2,5] = 2
    
    elif j_m == 0:
        
        d_jm = +1
        if bb==0.0:
            r = 0.5*c2*(1-j_m) + q_c*cx1*(1-j_m)*i_m
        else:
            r = q_c*cx1*(1-j_m)*i_m
        output_arr[k+2*s+2,0] = k
        output_arr[k+2*s+2,1] = i
        output_arr[k+2*s+2,2] = j
        output_arr[k+2*s+2,3] = d_jm
        output_arr[k+2*s+2,4] = r
        output_arr[k+2*s+2,5] = 2
    
    
    s=s+1

output_arr = output_arr[~np.all(output_arr == 0, axis=1)]

return output_arr

sample function([1,0,0,1],[0,1,0,1])

Using multiple threads is not always faster. There are plenty of reasons, but it is hard to tell which one without a minimal *reproducible* example. One reason is that the computation take a very short time and the time to create threads is significant compared to the computation time. Another, is that the computation is totally *memory bound* (the RAM is a shared resource so it does not scale with the number of thread used). — Jérôme Richard, Jan 11 '22 at 23:49
@JérômeRichard Do you know some way to output some information so I can test this? I can then added the working example+ output do this question. — jcp, Jan 12 '22 at 00:01
A computing time as well as the size of the input array (for a case that is typically an issue) should help a lot. If possible, you could provide the exact RAM+processor information (although is a bit technical) which should help a bit more. For example, processor i5-9600KF & 2 DIMM at 3200MHz. This can be used to know the maximum theoretical memory throughput. You can easily get these information in the task manager on Windows. For the rest, a more complete example code is required. Most of the time, a lack of speed up is actually due to race conditions (ie. sneaky bugs) in the code. — Jérôme Richard, Jan 12 '22 at 00:25
@JérômeRichard I update some of the information in the answer. including information about the architecture, and times, and input and output sizes. I will try to prepare a more complete working example, and update later. does this information help a little bit? — jcp, Jan 12 '22 at 01:19
2,000 elements doesn't sound like a very large collection. Given that you're looking at execution times of under 0.001 seconds anyway, adding threads just causes more overhead than time saved - if any time is saved at all (we can't tell from a lack of information about the computation). Threads aren't a magical sauce that makes everything faster - in fact, if the resource you're looking to share is CPU, you need to use processes, not threads for any substantial gain (and the overhead is even larger). Threads will only help if you limiting factor is I/O or similar. — Grismar, Jan 12 '22 at 01:21
@Grismar Thank you for the comment. I see. that does make sense. does multithreading help if the function is called multiple times in the course of this simulation this function is called approximately 2,000,000 times. (which also makes it a good target for optimisation, this is the most expensive part). — jcp, Jan 12 '22 at 01:27
Multi-threading creates multiple threads, but all of them run on the same core - so unless these threads are waiting for something (like a file write, a web request, activity on a device, etc.) there is really no performance improvement to be expected from adding threading (just overhead from switching between them). Multiprocessing may help if you have spare cores, but a process is a heavy thing to create, so this only helps if there is substantial work to be done (like a costly computation or a large amount of data to process) — Grismar, Jan 12 '22 at 01:31
@grismar that's completely wrong. Multi-threading uses different cores. The limitation on Python is the GIL (which prevents multiple threads manipulating Python objects simultaneously) — DavidW, Jan 12 '22 at 11:16
From the provided information, it looks like this does not comes from the time to create thread since you create only 2 threads and this time is generally much smaller than 1-3 ms on Linux for 2 threads (I expect no more than 100 us per thread). I do not think the saturation of the memory can explain the timings alone too because it should not be 2 times slower with 2 threads. That being said, you are talking about 2 core while hardware info states 2 socket of 12 cores (`CoresPerSocket=14`, `Sockets=2`). Is this correct? — Jérôme Richard, Jan 12 '22 at 12:06
@DavidW - you're definitely right that I should have said all of them run within the same GIL, which means that only a single thread can ever hold control of the Python interpreter. You're right that the active thread could still be using multiple cores - but multithreading doesn't allow execution of multiple threads in parallel on separate cores. You need separate processes (each with their own GIL) for that. Sorry for not phrasing it sufficiently clearly - but the point is the same. If you want processes to be executed in parallel, independently, on separate cores, threads won't do it. — Grismar, Jan 13 '22 at 00:07
This should not be relevant here anyway since the OP used `nogil=True` in the hot parallel loop, isn't it? — Jérôme Richard, Jan 13 '22 at 00:25
@JérômeRichard indeed. this gil was indeed released. Also the yes the number of cores is 14*2 from what I know about the cluster. I guess the consensus in the comments section seems to converge on the original explanation you offered. Would parallelising Block 1, Block 2 and Block 3 give a performance boost in this case? instead of using prange()? — jcp, Jan 13 '22 at 23:59
From the above code, I think the parallel computation can improved. But as I said before, it is not possible to tell much about how to do that without a reproducible code. Previous provided causes are not sufficient to fully explain the observed effects. There are many other possibilities that can explain the slowdown (NUMA effects, false sharing, bugs like race conditions), but testing each possibility on a semi-black-box code would take too much time. — Jérôme Richard, Jan 14 '22 at 00:25
@JérômeRichard I can however list out the the computations being performed inside the k loop. the other functions and parts of the overall code would also just be too long to quote in the post. I don't if this final bit of additional information is helpful/insightful or not. — jcp, Jan 14 '22 at 01:27
It is still not reproducible because several variables are undefined, but this is much better to understand what is going on. The main computational part is the block that compute `log`/`exp` functions and divisions. Assuming this part is executed quite often and many terms like `Qc` and `Qn` are constant in the loop, then the parallel version can be slower due to a problem described in [this post](https://stackoverflow.com/a/70718662/12939557) I just answered (note that Cython and Numba share this problem). You can try to *precompute constants* as much as possible and check if this is better. — Jérôme Richard, Jan 15 '22 at 03:47
Note that you cannot directly use `s=s+1` in the parallel loop (I do not know if this was the case) unless `s` is explicitly initialized differently in each thread. You should write something like `s = s_init + k` so to avoid a loop carried dependency. — Jérôme Richard, Jan 15 '22 at 04:09
@JérômeRichard thanks for your comments, I will try and put a minimal reproducible example over the weekend. The issue there are many extraneous computations and the code is kinda long, so I have to go through it and erase a bunch of stuff. `s=s+1` issue, this is from the last working serial version. in the parallel version i write using `k + ..`. thanks for suggestion about precomputing constants. would parallelising over the blocks instead of k solve help this issue at all? — jcp, Jan 15 '22 at 05:23
I did try reading the answer to that other post, but I am embarrassed to admit I don't understand it very well. although it does seem unintuitive to me that computing something that in principle only needs to be computed if certain conditions are met, would be faster than computing it only when needed inside the loop. — jcp, Jan 15 '22 at 05:47
Ok. I am not sure to understand what you mean by "parallelising over the blocks". If this means filtering the input N times (with N the number of blocks) and for each performing a parallel loop, then I expect better performance, but I do not think this will solve the problem. Working on chunks like in the linked answer should help (assuming this is the actual problem). — Jérôme Richard, Jan 15 '22 at 13:28
The issue with constant propagation is a very low-level one. It should not theoretically occur on very good compilers. Thus, people like you should theoretically care about this. However, mainstream compilers are not perfect so such complex behavior can unfortunately appear. I found this issue several years ago and I do not expect this to be fixed soon (since it is pretty complex to completely fix it). Still, I am not totally sure this is the actual problem. — Jérôme Richard, Jan 15 '22 at 13:33
@JérômeRichard Moving the computations outside the loop gives 0.0009197398246941845 (using prange) and 0.0008909936871728182 (without prange). so i guess this disproves the constant propagation hypothesis? — jcp, Jan 29 '22 at 05:23
@jcp Well, I think removing the computation in the `q_c == 1 and bb == 1.0` part is enough to make the algorithm almost completely memory-bound. If so, the parallel version should not be significantly faster assuming one core can saturate the memory. Regarding the small timings and the time to create additional threads, a small speed up is certainly mitigated with the added overheads. So I do not think it disproves the hypothesis. Looking the assembly code should help to check this hypothesis. — Jérôme Richard, Feb 01 '22 at 20:39

Parallel Processing in Cython (multithreading)

0 Answers0