2

This has to deal with a similar problem here: Calling BLAS / LAPACK directly using the SciPy interface and Cython but is different because I'm using the actual code in the SciPy example here _test_dgemm: https://github.com/scipy/scipy/blob/master/scipy/linalg/cython_blas.pyx which is extremely fast (5x faster than numpy.dot when an output matrix is input or 20x faster if not). It produces no results if Mx1 1xN vectors are passed. It produces the same values as numpy.dot with matrices passed. I've minimized the code since no answers have been posted for clarity. Here's the dgemm.pyx.:

import numpy as np
cimport numpy as np
from scipy.linalg.cython_blas cimport dgemm
from cython cimport boundscheck

@boundscheck(False)
cpdef int fast_dgemm(double[:,::1] a, double[:,::1] b, double[:,::1] c, double alpha=1.0, double beta=0.0) nogil except -1:

    cdef:
        char *transa = 'n'
        char *transb = 'n'
        int m, n, k, lda, ldb, ldc
        double *a0=&a[0,0]
        double *b0=&b[0,0]
        double *c0=&c[0,0]

    ldb = (&a[1,0]) - a0 if a.shape[0] > 1 else 1
    lda = (&b[1,0]) - b0 if b.shape[0] > 1 else 1

    k = b.shape[0]
    if k != a.shape[1]:
        with gil:
            raise ValueError("Shape mismatch in input arrays.")
    m = b.shape[1]
    n = a.shape[0]
    if n != c.shape[0] or m != c.shape[1]:
        with gil:
            raise ValueError("Output array does not have the correct shape.")
    ldc = (&c[1,0]) - c0 if c.shape[0] > 1 else 1
    dgemm(transa, transb, &m, &n, &k, &alpha, b0, &lda, a0,
               &ldb, &beta, c0, &ldc)
    return 0

Here's a sample test script:

import numpy as np;

a=np.random.randn(1000);
b=np.random.randn(1000);
a.resize(len(a),1);
a=np.array(a, order='c');
b.resize(1,len(b)); 
b=np.array(b, order='c');
c = np.empty((a.shape[0],b.shape[1]), float, order='c'); 

from dgemm import _test_dgemm; 
_test_dgemm(a,b,c);

And if you want to play with it on Windows with Python 3.5 x64 here's the setup.py to build it via the command prompt typing python setup.py build_ext --inplace --compiler=msvc

from Cython.Distutils import build_ext
import numpy as np
import os

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension

module = 'dgemm'

ext_modules = [Extension(module, sources=[module + '.pyx'],
              include_dirs=['C://Program Files (x86)//Windows Kits//10//Include//10.0.10240.0//ucrt','C://Program Files (x86)//Microsoft Visual Studio 14.0//VC//include','C://Program Files (x86)//Windows Kits//8.1//Include//shared'],
              library_dirs=['C://Program Files (x86)//Windows Kits//8.1//bin//x64', 'C://Windows//System32', 'C://Program Files (x86)//Microsoft Visual Studio 14.0//VC//lib//amd64', 'C://Program Files (x86)//Windows Kits//8.1//Lib//winv6.3//um//x64', 'C://Program Files (x86)//Windows Kits//10//Lib//10.0.10240.0//ucrt//x64'],
              extra_compile_args=['/Ot', '/favor:INTEL64', '/EHsc', '/GA'],
              language='c++')]

setup(
    name = module,
    ext_modules = ext_modules,
    cmdclass = {'build_ext': build_ext},
    include_dirs = [np.get_include(), os.path.join(np.get_include(), 'numpy')]
    )

Any help is much appreciated!

ead
  • 32,758
  • 6
  • 90
  • 153
Matt
  • 2,602
  • 13
  • 36
  • Huh, I thought you had to `cdef extern from "blas.h" nogil:` before using `dgemm`. Plus, I don't get why other examples use function pointers – Breno Feb 21 '23 at 16:30

1 Answers1

3

If I see it right, you try to use fortran-routines for arrays with c-memory-layout.

Even if it is obviously known to you, I would like first to elaborate on the row-major-order (c-memory-layout) and the column-major-order (fortran-memory-layout), in order to deduce my answer.

So if we have a 2x3 matrix (i.e. 2 rows and 3 columns) A, and store it in some continuous memory we get:

row-major-order(A) = A11, A12, A13, A21, A22, A23
col-major-order(A) = A11, A21, A12, A22, A13, A33

That means if we get a continuous memory, which represents a matrix in the row-major-order, and interpret it as a matrix in column-major-order we will get quite a different matrix!

However, we we take a look at the transposed matrix A^t we can easily see:

row-major-order(A) = col-major-order(A^t)
col-major-order(A) = row-major-order(A^t)

That means, if we would like to get the matrix C in row-major-order as result, the blas-routine should write the transposed matrix C in column-major-order (after all this we cannot change) into this very memory. However, C^t=(AB)^t=B^t*A^t and B^t an A^t are the original matrices reinterpreted in column-major-order.

Now, let A be a n x k-matrix and B a k x m-matrix, the call of dgemm routine should be as follows:

dgemm(transa, transb, &m, &n, &k, &alpha, b0, &m, a0, &k, &beta, c0, &m)

As you can see, you switched some n and m in your code.

Matt
  • 2,602
  • 13
  • 36
ead
  • 32,758
  • 6
  • 90
  • 153
  • Thank you - that worked for the vectors. I posted an edit to match the syntax. The C format is intentional as it is being called from NumPy although the underlying libraries are Fortran. Great explanation by the way. – Matt Jul 20 '17 at 21:22