Numpy, Scipy: very big difference for np.dot working with nans on two machines

Question

I have two machines - my own and server. The task is to compute cosine distances between many vectors and one given vector. After normalization, the cosine is equivalent to just dot product:

cosine(u,v) = 1.0 - np.dot(u, v) / (norm(u) * norm(v))

So I decided to normalize my vectors and have 3x speedup. BUT there is one little difference. Some of the vectors can be equal to the vectors of np.nans:

nullvect = [np.nan]*300

This is the essential part of my model. Suppose the following experiment comparing scipy.spatial.distance.cosine, np.dot and dot product from blas library:

import numpy as np

from scipy.spatial.distance import cdist

#import blas version
import scipy
gemv = scipy.linalg.get_blas_funcs("gemv") # Numpy somehow does not want to use optimized version. Let's call BLAS directly
dot = lambda x,y: gemv(1,x,y)

# generate data
vectors = np.random.randn(1000,300)
# normalize and write contiguously
vectors = np.ascontiguousarray([v/np.linalg.norm(v) for v in vectors])
# set some of them to nans
nullvect = [np.nan]*300
# place a lot of nan vectors
vecs = []
for vector in vectors:
    if np.random.randint(2) == 1:
        vecs += [nullvect]
    else:
        vecs += [vector]        
vecs = np.ascontiguousarray(vecs)
# normalize traget vector as well
vec = np.random.randn(300)
vec = vec/np.linalg.norm(vec)

import time

def eval_func():
    # cosine distances
    start = time.time()
    dists = cdist([vec], vecs, metric='cosine').ravel()
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")
    # np.dot distances
    start = time.time()
    dists = 1-vecs.dot(vec.T)
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")
    # blas distances
    start = time.time()
    dists = 1-dot(vecs, vec.T)
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")

eval_func()

On my local machine everything is OK - dot product is much more than 3x faster:

Overall computation took 5.55992126465 ms
Overall computation took 0.340938568115 ms
Overall computation took 11.9209289551 ms

But on server something is wrong - np.dot is a complete trash!

Overall computation took 5.462169647216797 ms
Overall computation took 81.59875869750977 ms
Overall computation took 15.769004821777344 ms

On both computers running Ubuntu 16.04 numpy 1.14 was installed via pip after OpenBlas was installed via apt-get.

On both computers picture is the same:

>>> np.__config__.show()
lapack_mkl_info:
  NOT AVAILABLE
blas_opt_info:                                                                                                                                                        
    library_dirs = ['/usr/local/lib']                                                                                                                                 
    define_macros = [('HAVE_CBLAS', None)]                                                                                                                            
    language = c                                                                                                                                                      
    libraries = ['openblas', 'openblas']                                                                                                                              
lapack_opt_info:                                                                                                                                                      
    library_dirs = ['/usr/local/lib']                                                                                                                                 
    define_macros = [('HAVE_CBLAS', None)]                                                                                                                            
    language = c                                                                                                                                                      
    libraries = ['openblas', 'openblas']                                                                                                                              
blis_info:                                                                                                                                                            
  NOT AVAILABLE                                                                                                                                                       
openblas_lapack_info:                                                                                                                                                 
    library_dirs = ['/usr/local/lib']                                                                                                                                 
    define_macros = [('HAVE_CBLAS', None)]                                                                                                                            
    language = c
    libraries = ['openblas', 'openblas']
blas_mkl_info:
  NOT AVAILABLE
openblas_info:
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
    libraries = ['openblas', 'openblas']

-

# update-alternatives --config libblas.so.3
There are 2 choices for the alternative libblas.so.3 (providing /usr/lib/libblas.so.3).

  Selection    Path                                 Priority   Status
------------------------------------------------------------
* 0            /usr/lib/openblas-base/libblas.so.3   40        auto mode
  1            /usr/lib/libblas/libblas.so.3         10        manual mode
  2            /usr/lib/openblas-base/libblas.so.3   40        manual mode

Server runs Python 3.6.2, my local machine - 3.5.2. Without nans in data everything is Ok.

Server is Xeon with 8 processors, my computer is i7 laptop with 4 cores (so server definitely should beat me with parallelism).

EDIT: If I use vectors instead of vecs (no nans), the result is also very stochastic and can be either bigger, or smaller than cosine - in can be 1ms or 7ms and vise versa. The results for two other algorithms are constant.

Some runs without nans (vectors):

Overall computation took5.519390106201172 ms
Overall computation took1.5783309936523438 ms
Overall computation took8.22305679321289 ms
>>> eval_func()
Overall computation took5.586147308349609 ms
Overall computation took2.3889541625976562 ms
Overall computation took8.22305679321289 ms
>>> eval_func()
Overall computation took5.342960357666016 ms
Overall computation took8.533000946044922 ms
Overall computation took8.122920989990234 ms
>>> eval_func()
Overall computation took5.633831024169922 ms
Overall computation took1.8835067749023438 ms
Overall computation took8.115768432617188 ms
>>> eval_func()
Overall computation took5.548000335693359 ms
Overall computation took1.7952919006347656 ms
Overall computation took8.14199447631836 ms
>>> eval_func()
Overall computation took5.550384521484375 ms
Overall computation took2.02178955078125 ms
Overall computation took8.759498596191406 ms
>>> eval_func()
Overall computation took5.5217742919921875 ms
Overall computation took13.997554779052734 ms
Overall computation took8.320808410644531 ms
>>> eval_func()
Overall computation took5.490779876708984 ms
Overall computation took2.6559829711914062 ms
Overall computation took9.52005386352539 ms
>>> eval_func()
Overall computation took5.395412445068359 ms
Overall computation took2.0003318786621094 ms
Overall computation took8.37564468383789 ms
>>> eval_func()
Overall computation took5.440711975097656 ms
Overall computation took2.224445343017578 ms
Overall computation took8.237361907958984 ms

Some runs with nans (vecs):

>>> eval_func()
Overall computation took5.4454803466796875 ms
Overall computation took26.26657485961914 ms
Overall computation took7.8868865966796875 ms
>>> eval_func()
Overall computation took5.366802215576172 ms
Overall computation took19.97232437133789 ms
Overall computation took7.729530334472656 ms
>>> eval_func()
Overall computation took5.433559417724609 ms
Overall computation took19.736289978027344 ms
Overall computation took7.9345703125 ms
>>> eval_func()
Overall computation took5.447864532470703 ms
Overall computation took18.61095428466797 ms
Overall computation took8.14676284790039 ms
>>> eval_func()
Overall computation took5.567073822021484 ms
Overall computation took1.926422119140625 ms
Overall computation took7.729530334472656 ms
>>> eval_func()
Overall computation took5.640983581542969 ms
Overall computation took20.401477813720703 ms
Overall computation took7.8868865966796875 ms

EDIT 2:

Setting

import os
os.environ['OMP_NUM_THREADS']='1'

didn't help

Maybe I can switch to some another library?

EDIT3:

On both machines

>>> import inspect
>>> inspect.getmodule(np.dot)
<module 'numpy.core.multiarray' from '/usr/local/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so'>

However, looks like this can be different

EDIT 4: Greatest surprise: np.inner() seems to work Ok (it's equivalent to np.dot for matrix-vector multiplication, but a bit slower):

def eval_func():
    # cosine distances
    start = time.time()
    dists = cdist([vec], vecs, metric='cosine').ravel()
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")
    # np.dot distances
    start = time.time()
    dists = 1-vecs.dot(vec.T)
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")
    # blas distances
    start = time.time()
    dists = 1-dot(vecs, vec.T)
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")


    start = time.time()
    dists = 1-np.inner(vecs, vec.T)
    end = time.time()
    print("Overall computation took"+ str((end-start)*10000)+" ms")

eval_func()


>>> eval_func()
Overall computation took5.500316619873047 ms
Overall computation took37.69874572753906 ms
Overall computation took8.03232192993164 ms
Overall computation took0.6771087646484375 ms
>>> eval_func()
Overall computation took5.517005920410156 ms
Overall computation took1.811981201171875 ms
Overall computation took7.5435638427734375 ms
Overall computation took0.6580352783203125 ms
>>> eval_func()
Overall computation took5.688667297363281 ms
Overall computation took38.02061080932617 ms
Overall computation took7.672309875488281 ms
Overall computation took0.5245208740234375 ms
>>> eval_func()
Overall computation took5.614757537841797 ms
Overall computation took38.6500358581543 ms
Overall computation took8.053779602050781 ms
Overall computation took0.6699562072753906 ms
>>> eval_func()
Overall computation took5.481243133544922 ms
Overall computation took39.615631103515625 ms
Overall computation took7.724761962890625 ms
Overall computation took0.5125999450683594 ms
>>> eval_func()
Overall computation took5.562305450439453 ms
Overall computation took2.0051002502441406 ms
Overall computation took7.450580596923828 ms
Overall computation took0.5030632019042969 ms
>>> eval_func()
Overall computation took5.252361297607422 ms
Overall computation took3.883838653564453 ms
Overall computation took7.433891296386719 ms
Overall computation took0.705718994140625 ms

EDIT5: So I wanted to replace cdist with np.inner in my program and got slowdown! Here are comparison of time results in ms between cdist (first) and np.inner(second). The last tuple is the shape of vectors array:

2018-02-07 23:12:37,609 0.6237030029296875 20.6756591796875 (1381, 300)
2018-02-07 23:12:37,612 0.6270408630371094 0.7233619689941406 (1381, 300)
2018-02-07 23:12:37,614 2.070188522338867 4.981040954589844 (2879, 300)
2018-02-07 23:12:37,630 19.373416900634766 28.1679630279541 (2387, 300)
2018-02-07 23:12:37,641 0.6058216094970703 28.65886688232422 (1007, 300)
2018-02-07 23:12:37,642 1.2252330780029297 12.749433517456055 (2879, 300)
2018-02-07 23:12:37,649 29.929637908935547 76.58910751342773 (9151, 300)
2018-02-07 23:12:37,649 1.0759830474853516 15.654802322387695 (2387, 300)
2018-02-07 23:12:37,653 0.5512237548828125 6.318569183349609 (1007, 300)
2018-02-07 23:12:37,664 0.4935264587402344 6.627321243286133 (1007, 300)
2018-02-07 23:12:37,670 1.275777816772461 11.78598403930664 (2879, 300)
2018-02-07 23:12:37,679 0.47206878662109375 10.010242462158203 (1007, 300)
2018-02-07 23:12:37,679 0.46634674072265625 16.28565788269043 (922, 300)
2018-02-07 23:12:37,683 0.7443428039550781 29.844284057617188 (1381, 300)
2018-02-07 23:12:37,686 26.117801666259766 36.93413734436035 (9929, 300)
2018-02-07 23:12:37,693 1.2125968933105469 13.280630111694336 (2879, 300)
2018-02-07 23:12:37,698 0.4277229309082031 6.974458694458008 (922, 300)
2018-02-07 23:12:37,698 0.5869865417480469 8.502006530761719 (1381, 300)
2018-02-07 23:12:37,708 0.48828125 22.032976150512695 (1007, 300)
2018-02-07 23:12:37,710 0.4360675811767578 6.644725799560547 (922, 300)
2018-02-07 23:12:37,713 0.6110668182373047 8.195638656616211 (1381, 300)
2018-02-07 23:12:37,719 1.1529922485351562 12.442350387573242 (2879, 300)
2018-02-07 23:12:37,720 0.4718303680419922 5.584001541137695 (1007, 300)
2018-02-07 23:12:37,725 0.46443939208984375 13.801336288452148 (922, 300)
2018-02-07 23:12:37,729 0.5846023559570312 2.321958541870117 (1381, 300)
2018-02-07 23:12:37,737 0.4661083221435547 4.40669059753418 (922, 300)
2018-02-07 23:12:37,746 0.09751319885253906 0.033855438232421875 (4, 300)
2018-02-07 23:12:37,749 0.4220008850097656 6.46662712097168 (922, 300)
2018-02-07 23:12:37,750 0.07700920104980469 0.030040740966796875 (4, 300)
2018-02-07 23:12:37,751 0.16117095947265625 26.17502212524414 (62, 300)
2018-02-07 23:12:37,751 0.14591217041015625 19.942522048950195 (37, 300)
2018-02-07 23:12:37,752 0.0743865966796875 0.029325485229492188 (4, 300)
2018-02-07 23:12:37,753 0.1418590545654297 1.1310577392578125 (37, 300)
2018-02-07 23:12:37,753 0.07486343383789062 0.019788742065429688 (4, 300)
2018-02-07 23:12:37,754 0.10466575622558594 0.9295940399169922 (37, 300)
2018-02-07 23:12:37,755 0.07486343383789062 0.029802322387695312 (4, 300)
2018-02-07 23:12:37,755 0.10275840759277344 0.8199214935302734 (37, 300)
2018-02-07 23:12:37,757 0.10943412780761719 0.7965564727783203 (37, 300)
2018-02-07 23:12:37,758 0.11110305786132812 0.8292198181152344 (37, 300)
2018-02-07 23:12:37,758 0.07224082946777344 0.029325485229492188 (4, 300)
2018-02-07 23:12:37,765 8.567333221435547 22.03059196472168 (9929, 300)
2018-02-07 23:12:37,771 0.12874603271484375 17.62080192565918 (62, 300)
2018-02-07 23:12:37,776 0.12159347534179688 2.850055694580078 (62, 300)
2018-02-07 23:12:37,782 4.766225814819336 44.742584228515625 (9151, 300)
2018-02-07 23:12:37,783 0.14066696166992188 4.610538482666016 (62, 300)
2018-02-07 23:12:37,786 0.12326240539550781 0.9648799896240234 (62, 300)
2018-02-07 23:12:37,788 0.12063980102539062 0.9310245513916016 (62, 300)
2018-02-07 23:12:37,814 0.26988983154296875 26.521682739257812 (428, 300)

I just don't understand what's going on...

EDIT6: Reinstalled docker image with atlas library, but newly installed numpy still says in np.__config__.show() that it uses openblas. Which is not even present in the system.

EDIT7: Rebuild numpy to use atlas. Now the results for np.dot and np.inner are at last stable, though only two times better than for scipy cdist. So it seems to be openblas issue on Amazon ec2 server. The results are entirely stochastic for it.

I've checked the situation outside docker, it's the same.

1. Is the server dedicated to just you? If it was heavily tasked with other services at the time it may not have. 2. Do you have access to all cores on the server? i.e. can you ACTUALLY use the maximum parallelism for `np.dot` — Chrispresso, Feb 07 '18 at 21:06
Server is on Amazon. `top` command shows no other processes, the benchmark with awful `np.dot` is quite different (can be 33ms, 80ms, ...), but is always much bigger than other two. Even if I use only single core, dot product can never be slower than cosine, which is 3x more complicated. — Slowpoke, Feb 07 '18 at 21:09
@ZWiki I am running inside docker container, but can see output of top command from server shell itself. — Slowpoke, Feb 07 '18 at 21:10
@ZWiki Maybe indeed cloud server is somewhat virtual and OpenBlas is somehow sensitive to the real workload? How can I make is working normally then? — Slowpoke, Feb 07 '18 at 21:21
Please measure the performance of at least 1000 iterations. Your results can be missleading. Meassuring the performance with time.time() with a function taking only a few ms is not recommended... — max9111, Feb 08 '18 at 11:44

Numpy, Scipy: very big difference for np.dot working with nans on two machines

0 Answers0