As i was recommended by DavidW in this Topic, I'm trying to make a C wrapper function using OpenMP in order to multithread Cython code.
Here is what i have :
The C file "paral.h":
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void paral(void (*func)(int,int), int nthreads){
int t;
#pragma omp parallel for
for (t = 0; t < nthreads; t++){
(*func)(t, nthreads);
}
}
The test.pyx file:
import time
import random
cimport cython
from libc.stdlib cimport malloc, realloc, free
ctypedef void (*func)(int,int)
cdef extern from "paral.h":
void paral(func function, int nthreads) nogil
cdef double *a = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *b = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *c = <double *> malloc ( 1000000 * sizeof(double) )
cdef int i
for i in range(1000000):
a[i] = random.random()
b[i] = random.random()
cdef void sum_ab(int thread, int nthreads):
cdef int start, stop, i
start = thread * (1000000 / nthreads)
stop = start + (1000000 / nthreads)
for i in range(start, stop):
c[i] = a[i] + b[i]
t0 = time.clock()
with nogil:
paral(sum_ab,4)
print(time.clock()-t0)
t0 = time.clock()
with nogil:
paral(sum_ab,1)
print(time.clock()-t0)
I have Visual Studio, so in the setup.py I have add:
extra_compile_args=["/openmp"],
extra_link_args=["/openmp"]
Results: The 4-threaded is slightly slower than the 1-threaded. If someone know what i'm doing wrong here.
Edit:
In response to Zultan.
To ensure that the time measured by time.clock() is correct, i make the execution last a few seconds, to be able to compare the time i get with time.clock() and time i measure with a stopwtach. Somthing like this:
print("start timer 1")
t1 = time.clock()
for i in range(10000):
with nogil:
paral(sum_ab,4)
t2 = time.clock()
print(t2-t1)
print("strart timer 2")
t1 = time.clock()
for i in range(10000):
with nogil:
paral(sum_ab,1)
t2 = time.clock()
print(t2-t1)
print("stop")
Results with time.clock() are 15.0s 4-threaded, 14.5s 1-threaded and i see no noticable difference with what i measure.
Edit 2: I think i've figured out what is happenng here. I read in some cases memory bandwidth can be saturated. If i replace:
c[i] = a[i] + b[i]
by a more complex operation, for example:
c[i] = a[i]**b[i]
Now i have significant speedup between the single and the multi threaded (near x2).
However, i'm still 2x slower than a classic prange loop! I see no reason why the prange is that faster. Maybe i need to change the C code...