2

So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication

I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).

When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:

python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds

What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:

python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds

Did i make some obvious mistakes?

import numpy as np
import numba as nb
import scipy.sparse
import time


class benchmark(object):
    def __init__(self, name):
        self.name = name

    def __enter__(self):
        self.start = time.time()

    def __exit__(self, ty, val, tb):
        end = time.time()
        print("%s : %0.4f seconds" % (self.name, end-self.start))
        return False


def dot_py(A, B):
    m, n = A.shape
    p = B.shape[1]

    C = np.zeros((m, p))

    for i in range(0, m):
        for j in range(0, p):
            for k in range(0, n):
                C[i, j] += A[i, k] * B[k, j]
    return C


def dot_np(A, B):
    C = np.dot(A,B)
    return C


def dot_scipy(A, B):
    C = A * B
    return C

dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)

dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()

print("starting test")

with benchmark("python"):
    dot_py(a_full, b_full)

with benchmark("numpy"):
    dot_np(a_full, b_full)

with benchmark("numba"):
    dot_nb(a_full, b_full)

with benchmark("scipy"):
    dot_scipy(a, b)

print("finishing test")

edit:

for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero). runtime of nxn matrix multiplication

Community
  • 1
  • 1
JanM
  • 53
  • 5
  • Probably running vendor BLAS, you can check this out by running `np.__config__.show()` and seeing if there is anything like MKL, OpenBLAS, ATLAS, etc. Edit: You are also running sparse matrices which will change things. SciPy actually has sparse BLAS I think while everything else will convert those matrices to dense. – Daniel Jun 23 '16 at 14:43
  • It seem i do not have any special BLAS installed. Wouldn't that impact scipy, numpy and maybe numba only though? Or would the three nested loops be optimized in the python function as well? – JanM Jun 23 '16 at 14:45
  • It would only effect scipy and numpy. – Daniel Jun 23 '16 at 15:02

1 Answers1

3

In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.

Community
  • 1
  • 1
JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • No problem. If this solves your issue, please accept the answer as the correct one. – JoshAdel Jun 23 '16 at 14:58
  • Actually kind of makes the initial comparisons somewhat worthless. Its easy to make a long zip index efficient. You can probably just loop over `np.vdot` and get the same performance at that point. – Daniel Jun 23 '16 at 15:03