Efficient calculation of cosine_distance of a csc_sparse_matrix using scipy.spatial.distance

Question

Have a csc_matrix sparse named as eventPropMatrix having datatype=float64 with shape=(13000,7). Upon which I am applying following distance calculating function. Here

eventPropMatrix.getrow(i).todense()==[[0. 0. 0. 0. 0. 0. 0.]]

eventPropMatrix.getrow(j).todense()==[[0. 0. 0. 0. 0. 0. 0.]]

with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=RuntimeWarning)
    epsim = scipy.spatial.distance.correlation(eventPropMatrix.getrow(i).todense(), eventPropMatrix.getrow(j).todense())

Here the scipy.spatial.distance.correlation is following:

def correlation(u, v, w=None, centered=True):
    """
    Compute the correlation distance between two 1-D arrays.

    The correlation distance between `u` and `v`, is
    defined as

    .. math::

        1 - \\frac{(u - \\bar{u}) \\cdot (v - \\bar{v})}
                  {{||(u - \\bar{u})||}_2 {||(v - \\bar{v})||}_2}

    where :math:`\\bar{u}` is the mean of the elements of `u`
    and :math:`x \\cdot y` is the dot product of :math:`x` and :math:`y`.

    Parameters
    ----------
    u : (N,) array_like
        Input array.
    v : (N,) array_like
        Input array.
    w : (N,) array_like, optional
        The weights for each value in `u` and `v`. Default is None,
        which gives each value a weight of 1.0

    Returns
    -------
    correlation : double
        The correlation distance between 1-D array `u` and `v`.

    """
    u = _validate_vector(u)
    v = _validate_vector(v)
    if w is not None:
        w = _validate_weights(w)
    if centered:
        umu = np.average(u, weights=w)
        vmu = np.average(v, weights=w)
        u = u - umu
        v = v - vmu
    uv = np.average(u * v, weights=w)
    uu = np.average(np.square(u), weights=w)
    vv = np.average(np.square(v), weights=w)
    dist = 1.0 - uv / np.sqrt(uu * vv)
    return dist

Here I am having "nan" values as return value for most of the time as uu=0.0 and vv=0.0

My query is that for the 13000 rows this calculation takes too much time. It has been running for last 15+ hours (i5, 8th Gen, 4 core processor, 12Gb RAM, Ubuntu). Is there any way around for this humongous calculation. I am contemplating to Cythonize the code into C and then compile and run. Will this help, if does then how to do this???

CPython probably won't solve it for you. It might make it a bit faster, but I doubt by much if you're waiting over 15 hours for your algorithm to complete. The problem is with the algorithm. — Todd, Feb 18 '20 at 08:23
Maybe you can get insight into what section of your code is consuming the most time with cProfile - here's another thread on profiling: https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script#582337 — Todd, Feb 18 '20 at 08:27
Do you have some package limitation. eg. from sklearn import metrics; PropMatrix_TMP = PropMatrix - np.mean(PropMatrix ,axis=1).reshape(-1,1); res=metrics.pairwise.cosine_distances(mat_TMP, mat_TMP) Does the job in about 3s. Is that efficient enough? There are very likely further speedups possible... — max9111, Feb 18 '20 at 13:23
Please refer **Events** class in # Source: BaseData.py lines ( 226 to 280) http://sujitpal.blogspot.com/2013/02/my-solution-to-kaggle-event.html — afghani, Feb 18 '20 at 13:57

Efficient calculation of cosine_distance of a csc_sparse_matrix using scipy.spatial.distance

0 Answers0