6

I wanted to create a distance proximity matrix for 10060 records/ points, where each record/point has 23 attributes using euclidean distance as metric. I wrote code using nested for loops to calculate distance between each point(leading to (n(n-1))/2) computations). It took a long time(about 8 minutes). When I used cdist it took so much lesser time(just 3 seconds !!!). When I looked at the source code, the cdist also uses nested for loops and moreover it makes n^2 computations(which is greater than the number of comparisons my logic does). What is making cdist execute faster and give correct output as well ? Please help me understand. Thanks in advance.

Sushodhan
  • 400
  • 1
  • 6
  • 16

1 Answers1

10

Where did you read the source code? The python code calls (if you follow it all the way down in the default metric='euclidean' case) the c code

static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
                 double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
                 const npy_intp num_cols)
{
    npy_intp i, j;

    for (i = 0; i < num_rowsA; ++i) {
        const double *u = XA + (num_cols * i);
        for (j = 0; j < num_rowsB; ++j, ++dm) {
            const double *v = XB + (num_cols * j);
            *dm = seuclidean_distance(var, u, v, num_cols);
        }
    }
    return 0;
}

where seuclidean_distance is

static NPY_INLINE double
seuclidean_distance(const double *var, const double *u, const double *v,
                    const npy_intp n)
{
    double s = 0.0;
    npy_intp i;

    for (i = 0; i < n; ++i) {
        const double d = u[i] - v[i];
        s += (d * d) / var[i];
    }
    return sqrt(s);
}

So it's actually a triple loop, but this is highly optimised C code. Python for loops are slow, they take up a lot of overhead and should never be used with numpy arrays because scipy/numpy can take advantage of the underlying memory data held within an ndarray object in ways that python can't.

FHTMitchell
  • 11,793
  • 2
  • 35
  • 47
  • I found the code at this location "https://github.com/scipy/scipy/blob/v1.1.0/scipy/spatial/distance.py#L2285-L2673". And yes I also understand that its a triple loop(was just neglecting the fact that any how I would need a third loop inside to calculate the distance). But I am not able to understand, that the number of computations I am doing is same in the code(moreover my code does less computations), but still it works much faster than my code, why is it so ? – Sushodhan Aug 01 '18 at 10:50
  • 1
    Because it is in C. Using python loops on numpy objects is very slow and if you ever find yourself doing it, you are doing it wrong. I can't help you more unless you paste your code into your question. You've pasted the python code but that loop is not actually what is being run unless you pass a function, the actual algorithm is in the lines `cdist_fn = getattr(_distance_wrap, "cdist_%s_%s_wrap" % (metric_name, typ));` `cdist_fn(XA, XB, dm, **kwargs)` which calls c code. – FHTMitchell Aug 01 '18 at 10:52
  • Okay. I think this might be the only case, that it is written in C, that is why it is so fast. I found source code in python so got confused. Thanks for pointing that out. – Sushodhan Aug 01 '18 at 10:56
  • 1
    @SushodhanVaishampayan https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html has some examples of translating Python code to C (via Cython) and the resulting speedups… – Sam Mason Feb 04 '19 at 14:56
  • This is not an answer. I have revised the code and when calling with a custom function there is no C code involved and it is still blazingly fast (at least for **pdist**). The question is still unanswered. – Adrian Oct 26 '21 at 08:29