Large kernel wait using numpy.fft with multiprocessing

Question

I would like to compute a set of ffts in parallel using numpy.fft.fft and multiprocessing. Unfortunately, running the ffts in parallel results in a large kernel load.

Here is a minimal example that reproduces the problem:

# fft_test.py
import numpy as np
import multiprocessing
from argparse import ArgumentParser


def f(i):
    x = np.empty(1000000)
    np.fft.fft(x)
    return i


def __main__():
    ap = ArgumentParser('fft_test')
    ap.add_argument('--single_core', '-s', action='store_true', help='use only a single core')
    args = ap.parse_args()

    # Show the configuration
    print("number of cores: %d" % multiprocessing.cpu_count())
    np.__config__.show()

    # Execute using a single core
    if args.single_core:
        for i in range(multiprocessing.cpu_count()):
            f(i)
            print(i, end=' ')
    # Execute using all cores
    else:
        pool = multiprocessing.Pool()
        for i in pool.map(f, range(multiprocessing.cpu_count())):
            print(i, end=' ')


if __name__ == '__main__':
    __main__()

Running time python fft_test.py gives me the following results:

number of cores: 48
openblas_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
openblas_lapack_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
blas_opt_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c
blas_mkl_info:
  NOT AVAILABLE
lapack_opt_info:
    library_dirs = ['/home/till/anaconda2/envs/sonalytic/lib']
    define_macros = [('HAVE_CBLAS', None)]
    libraries = ['openblas', 'openblas']
    language = c

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

real    0m7.422s
user    0m9.830s
sys 1m26.603s

Running with a single core, i.e. python fft_test.py -s gives

real    1m0.345s
user    0m56.558s
sys 0m2.959s

Any idea what might cause the large kernel wait?

Do you have the paid version of Anaconda with Intel MKL-compiled FFTs that are automatically multithreaded? — Ahmed Fasih, Oct 05 '16 at 13:41
I'm not using MKL because it would show in the config, *I think* (see http://stackoverflow.com/questions/22645423/make-sure-numpy-is-using-mkl-library-on-mac-pro). To make sure I'm running single-threaded I have also set `MKL_NUM_THREADS=1`, `OMP_NUM_THREADS=1`, 'NUMEXPR_NUM_THREADS=1`. — Till Hoffmann, Oct 05 '16 at 14:03

Large kernel wait using numpy.fft with multiprocessing

0 Answers0