Parallel C wrapper for cython code

Question

As i was recommended by DavidW in this Topic, I'm trying to make a C wrapper function using OpenMP in order to multithread Cython code.

Here is what i have :

The C file "paral.h":

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>


void paral(void (*func)(int,int), int nthreads){
    int t;
    #pragma omp parallel for
    for (t = 0; t < nthreads; t++){
        (*func)(t, nthreads);
    }
}

The test.pyx file:

import time
import random
cimport cython
from libc.stdlib cimport malloc, realloc, free

ctypedef void (*func)(int,int)

cdef extern from "paral.h":
    void paral(func function, int nthreads) nogil

cdef double *a = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *b = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *c = <double *> malloc ( 1000000 * sizeof(double) )

cdef int i
for i in range(1000000):
    a[i] = random.random()
    b[i] = random.random()

cdef void sum_ab(int thread, int nthreads):
    cdef int start, stop, i
    start = thread * (1000000 / nthreads)
    stop = start + (1000000 / nthreads)
    for i in range(start, stop):
        c[i] = a[i] + b[i]

t0 = time.clock()
with nogil:
    paral(sum_ab,4)
print(time.clock()-t0)

t0 = time.clock()
with nogil:
    paral(sum_ab,1)
print(time.clock()-t0)

I have Visual Studio, so in the setup.py I have add:

extra_compile_args=["/openmp"],
extra_link_args=["/openmp"]

Results: The 4-threaded is slightly slower than the 1-threaded. If someone know what i'm doing wrong here.

Edit:

In response to Zultan.

To ensure that the time measured by time.clock() is correct, i make the execution last a few seconds, to be able to compare the time i get with time.clock() and time i measure with a stopwtach. Somthing like this:

print("start timer 1")

t1 = time.clock()
for i in range(10000):
    with nogil:
        paral(sum_ab,4)
t2 = time.clock()

print(t2-t1)
print("strart timer 2")

t1 = time.clock()
for i in range(10000):
    with nogil:
        paral(sum_ab,1)
t2 = time.clock()

print(t2-t1)
print("stop")

Results with time.clock() are 15.0s 4-threaded, 14.5s 1-threaded and i see no noticable difference with what i measure.

Edit 2: I think i've figured out what is happenng here. I read in some cases memory bandwidth can be saturated. If i replace:

c[i] = a[i] + b[i]

by a more complex operation, for example:

c[i] = a[i]**b[i]

Now i have significant speedup between the single and the multi threaded (near x2).

However, i'm still 2x slower than a classic prange loop! I see no reason why the prange is that faster. Maybe i need to change the C code...

Possible duplicate of [OpenMP time and clock() calculates two different results](https://stackoverflow.com/questions/10673732/openmp-time-and-clock-calculates-two-different-results) — Zulan, Aug 11 '18 at 22:27
I have "manually" measured time by iterating a large number of time, results are the same. — lync maloe, Aug 11 '18 at 23:13
don't be so vague. Tell us exactly what you have done and what you observed- with actual numbers. — Zulan, Aug 12 '18 at 07:33
Thanks for the update. I recommend taking a physical stop-watch with the 15s threaded version and compare the results. Then check again with the linked question (particularly Hristo's answer). — Zulan, Aug 12 '18 at 16:21
Sorry if i haven't been clear, but it's what i have done. And i see no sensible difference between the time.clock() and the physical measure : 15.0 s 4-threaded, 14,5s 1-threaded with time.clock() / and 16s for both with physical stopwatch — lync maloe, Aug 12 '18 at 17:06
My apologies, this is such a common question so I was a bit too focused. I overlooked you are using Windows (where `clock` is actually wall-clock not cumulative CPU-time). Sorry I am not able to help more for your platform. — Zulan, Aug 12 '18 at 19:21

Parallel C wrapper for cython code

0 Answers0