I wanted to create a distance proximity matrix for 10060 records/ points, where each record/point has 23 attributes using euclidean distance as metric. I wrote code using nested for loops to calculate distance between each point(leading to (n(n-1))/2) computations). It took a long time(about 8 minutes). When I used cdist it took so much lesser time(just 3 seconds !!!). When I looked at the source code, the cdist also uses nested for loops and moreover it makes n^2 computations(which is greater than the number of comparisons my logic does). What is making cdist execute faster and give correct output as well ? Please help me understand. Thanks in advance.
Asked
Active
Viewed 5,602 times
1 Answers
10
Where did you read the source code? The python code calls (if you follow it all the way down in the default metric='euclidean'
case) the c code
static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
const npy_intp num_cols)
{
npy_intp i, j;
for (i = 0; i < num_rowsA; ++i) {
const double *u = XA + (num_cols * i);
for (j = 0; j < num_rowsB; ++j, ++dm) {
const double *v = XB + (num_cols * j);
*dm = seuclidean_distance(var, u, v, num_cols);
}
}
return 0;
}
where seuclidean_distance
is
static NPY_INLINE double
seuclidean_distance(const double *var, const double *u, const double *v,
const npy_intp n)
{
double s = 0.0;
npy_intp i;
for (i = 0; i < n; ++i) {
const double d = u[i] - v[i];
s += (d * d) / var[i];
}
return sqrt(s);
}
So it's actually a triple loop, but this is highly optimised C code. Python for
loops are slow, they take up a lot of overhead and should never be used with numpy arrays because scipy/numpy can take advantage of the underlying memory data held within an ndarray
object in ways that python can't.

FHTMitchell
- 11,793
- 2
- 35
- 47
-
I found the code at this location "https://github.com/scipy/scipy/blob/v1.1.0/scipy/spatial/distance.py#L2285-L2673". And yes I also understand that its a triple loop(was just neglecting the fact that any how I would need a third loop inside to calculate the distance). But I am not able to understand, that the number of computations I am doing is same in the code(moreover my code does less computations), but still it works much faster than my code, why is it so ? – Sushodhan Aug 01 '18 at 10:50
-
1Because it is in C. Using python loops on numpy objects is very slow and if you ever find yourself doing it, you are doing it wrong. I can't help you more unless you paste your code into your question. You've pasted the python code but that loop is not actually what is being run unless you pass a function, the actual algorithm is in the lines `cdist_fn = getattr(_distance_wrap, "cdist_%s_%s_wrap" % (metric_name, typ));` `cdist_fn(XA, XB, dm, **kwargs)` which calls c code. – FHTMitchell Aug 01 '18 at 10:52
-
Okay. I think this might be the only case, that it is written in C, that is why it is so fast. I found source code in python so got confused. Thanks for pointing that out. – Sushodhan Aug 01 '18 at 10:56
-
1@SushodhanVaishampayan https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html has some examples of translating Python code to C (via Cython) and the resulting speedups… – Sam Mason Feb 04 '19 at 14:56
-
This is not an answer. I have revised the code and when calling with a custom function there is no C code involved and it is still blazingly fast (at least for **pdist**). The question is still unanswered. – Adrian Oct 26 '21 at 08:29