Why is computing point distances so slow in Python?

Question

My Python program was too slow. So, I profiled it and found that most of the time was being spent in a function that computes distance between two points (a point is a list of 3 Python floats):

def get_dist(pt0, pt1):
    val = 0
    for i in range(3):
        val += (pt0[i] - pt1[i]) ** 2
    val = math.sqrt(val)
    return val

To analyze why this function was so slow, I wrote two test programs: one in Python and one in C++ that do similar computation. They compute the distance between 1 million pairs of points. (The test code in Python and C++ is below.)

The Python computation takes 2 seconds, while C++ takes 0.02 seconds. A 100x difference!

Why is Python code so much slower than C++ code for such simple math computations? How do I speed it up to match the C++ performance?

The Python code used for testing:

import math, random, time

num = 1000000

# Generate random points and numbers

pt_list = []
rand_list = []

for i in range(num):
    pt = []
    for j in range(3):
        pt.append(random.random())
    pt_list.append(pt)
    rand_list.append(random.randint(0, num - 1))

# Compute

beg_time = time.clock()
dist = 0

for i in range(num):
    pt0 = pt_list[i]
    ri  = rand_list[i]
    pt1 = pt_list[ri]

    val = 0
    for j in range(3):
        val += (pt0[j] - pt1[j]) ** 2
    val = math.sqrt(val)

    dist += val

end_time = time.clock()
elap_time = (end_time - beg_time)

print elap_time
print dist

The C++ code used for testing:

#include <cstdlib>
#include <iostream>
#include <ctime>
#include <cmath>

struct Point
{
    double v[3];
};

int num = 1000000;

int main()
{
    // Allocate memory
    Point** pt_list = new Point*[num];
    int* rand_list = new int[num];

    // Generate random points and numbers
    for ( int i = 0; i < num; ++i )
    {
        Point* pt = new Point;

        for ( int j = 0; j < 3; ++j )
        {
            const double r = (double) rand() / (double) RAND_MAX;
            pt->v[j] = r;
        }

        pt_list[i] = pt;
        rand_list[i] = rand() % num;
    }

    // Compute

    clock_t beg_time = clock();
    double dist = 0;
    for ( int i = 0; i < num; ++i )
    {
        const Point* pt0 = pt_list[i];
        int r = rand_list[i];
        const Point* pt1 = pt_list[r];

        double val = 0;
        for ( int j = 0; j < 3; ++j )
        {
            const double d = pt0->v[j] - pt1->v[j];
            val += ( d * d );
        }

        val = sqrt(val);
        dist += val;
    }
    clock_t end_time = clock();
    double sec_time = (end_time - beg_time) / (double) CLOCKS_PER_SEC;

    std::cout << sec_time << std::endl;
    std::cout << dist << std::endl;

    return 0;
}

Because compiled code is always going to beat out a byte-code interpreted dynamic language? Use `numpy` for computation across such large datasets. — Martijn Pieters, Apr 25 '13 at 09:02
Not an answer to your question, but, have you considered using numpy? — , Apr 25 '13 at 09:02
@FredrikPihl Yes, that helped. It reduced it to 1.34s. Still 100x slower than C code. — Ashwin Nanjappa, Apr 25 '13 at 09:10
@MartijnPieters Agree. But 100x difference for such simple code? — Ashwin Nanjappa, Apr 25 '13 at 09:13
@Ashwin: You are not exactly using Python's strengths here, nor is your code the most efficient. Using local scope vs. global scope makes a difference, unrolling the loop and avoiding attribute dereferencing can help too. — Martijn Pieters, Apr 25 '13 at 09:15
You could also try running this code with pypy. edit: For me pypy was 6.5x faster than cpython — John La Rooy, Apr 25 '13 at 09:19
@Evert I cannot use numpy for representing the points. My actual program gets the points from another module and it needs to use many other methods of this module that take this point type as input. — Ashwin Nanjappa, Apr 25 '13 at 09:23
Have you tried profiling your code, using e.g. [line_profiler](http://pythonhosted.org/line_profiler/), to see what takes longest? — Jan Schejbal, Apr 25 '13 at 09:26
@JanSchejbal I profiled my actual program using cProfile and that is how I found that get_dist is taking all the time. — Ashwin Nanjappa, Apr 25 '13 at 09:31
@gnibbler My actual program is built using lots of Python modules and third party libraries. I doubt I could use anything other than CPython right now. — Ashwin Nanjappa, Apr 25 '13 at 09:33
@Ashwin - You should be able to Cython, cython is fully compatible with CPython-code. It allows you to trow some types at your code, and produces a ".pyd" (CPython-API dll). Just write the resource-hungry code in cython, and import it with the rest of your code. You can even use vectors and other C-functions when using Cython. The once using your code don't need Cython - just read about it: http://www.cython.org/ — b0bz, Apr 25 '13 at 09:41
@JHolta This is news to me! I did not know about Cython. I will have to see if I can use it with my actual program. Thanks! — Ashwin Nanjappa, Apr 25 '13 at 09:51
I'd expect 100x difference *specifically* for "simple math computations" when comparing C vs. CPython. If a module produces millions of 3D points without using numpy; write a wrapper to get numpy arrays. Cython won't help you achieve C performance if you keep the loops in pure Python (`get_dist()` implemented in Cython would be almost instantaneous compared with `for i in xrange(num)` overhead (14ms for num=1000000 on my machine)). Cython interoperates with numpy arrays very well. You can use Cython if you can't express your computations as vectorized numpy operations. — jfs, Apr 25 '13 at 10:03
@J.F.Sebastian Your summary is correct. I'm newbie to numpy & Cython. I will explore these now. — Ashwin Nanjappa, Apr 25 '13 at 10:10
@jfs many algorithms are simply not "vectorizable". For a fun datapoint, I have a program written both in Python & c/MPFR (software FP!). The MPFR version with 237-bit mantisssa (octuple precision) is ~25% faster than Pythons's HW floating point! — m4r35n357, Aug 16 '23 at 10:35

David Hammen · Accepted Answer · 2013-04-25T20:55:44.823

A sequence of optimizations:

The original code, with small changes

import math, random, time

num = 1000000

# Generate random points and numbers

# Change #1: Sometimes it's good not to have too much randomness.
# This is one of those cases.
# Changing the code shouldn't change the results.
# Using a fixed seed ensures that the changes are valid.
# The final 'print dist' should yield the same result regardless of optimizations.
# Note: There's nothing magical about this seed.
# I randomly picked a hash tag from a git log.
random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8)
pt_list = []
rand_list = []

for i in range(num):
    pt = []
    for j in range(3):
        pt.append(random.random())
    pt_list.append(pt)

# Change #2: rand_list is computed in a separate loop.
# This ensures that upcoming optimizations will get the same results as
# this unoptimized version.
for i in range(num):
    rand_list.append(random.randint(0, num - 1))

# Compute

beg_time = time.clock()
dist = 0

for i in range(num):
    pt0 = pt_list[i]
    ri  = rand_list[i]
    pt1 = pt_list[ri]

    val = 0
    for j in range(3):
        val += (pt0[j] - pt1[j]) ** 2
    val = math.sqrt(val)

    dist += val

end_time = time.clock()
elap_time = (end_time - beg_time)

print elap_time
print dist

Optimization #1: Put the code in a function.

The first optimization (not shown) is to embed all of the code except the import in a function. This simple change offers a 36% performance boost on my computer.

Optimization #2: Eschew the `**` operator.

You don't use pow(d,2) in your C code because everyone knows that this is suboptimal in C. It's also suboptimal in python. Python's ** is smart; it evaluates x**2 as x*x. However, it takes time to be smart. You know you want d*d, so use it. Here's the computation loop with that optimization:

for i in range(num):
    pt0 = pt_list[i]
    ri  = rand_list[i]
    pt1 = pt_list[ri]

    val = 0 
    for j in range(3):
        d = pt0[j] - pt1[j]
        val += d*d 
    val = math.sqrt(val)

    dist += val

Optimization #3: Be pythonic.

Your Python code looks a whole lot like your C code. You aren't taking advantage of the language.

import math, random, time, itertools

def main (num=1000000) :
    # This small optimization speeds things up by a couple percent.
    sqrt = math.sqrt

    # Generate random points and numbers

    random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8)

    def random_point () :
        return [random.random(), random.random(), random.random()]

    def random_index () :
       return random.randint(0, num-1)

    # Big optimization:
    # Don't generate the lists of points.
    # Instead use list comprehensions that create iterators.
    # It's best to avoid creating lists of millions of entities when you don't
    # need those lists. You don't need the lists; you just need the iterators.
    pt_list = [random_point() for i in xrange(num)]
    rand_pts = [pt_list[random_index()] for i in xrange(num)]


    # Compute

    beg_time = time.clock()
    dist = 0 

    # Don't loop over a range. That's too C-like.
    # Instead loop over some iterable, preferably one that doesn't create the
    # collection over which the iteration is to occur.
    # This is particularly important when the collection is large.
    for (pt0, pt1) in itertools.izip (pt_list, rand_pts) :

        # Small optimization: inner loop inlined,
        # intermediate variable 'val' eliminated.
        d0 = pt0[0]-pt1[0]
        d1 = pt0[1]-pt1[1]
        d2 = pt0[2]-pt1[2]

        dist += sqrt(d0*d0 + d1*d1 + d2*d2)

    end_time = time.clock()
    elap_time = (end_time - beg_time)

    print elap_time
    print dist

Update

Optimization #4, Use numpy

The following takes about 1/40th the time of the original version in the timed section of the code. Not quite as fast as C, but close.

Note the commented out, "Mondo slow" computation. That takes about ten times as long as the original version. There is an overhead cost with using numpy. The setup takes quite a bit longer in the code that follows compared to that in my non-numpy optimization #3.

Bottom line: You need to take care when using numpy, and the setup costs might be significant.

import numpy, random, time

def main (num=1000000) :

    # Generate random points and numbers

    random.seed (0x7126434a2ea2a259e9f4196cbb343b1e6d4c2fc8)

    def random_point () :
        return [random.random(), random.random(), random.random()]

    def random_index () :
       return random.randint(0, num-1)

    pt_list = numpy.array([random_point() for i in xrange(num)])
    rand_pts = pt_list[[random_index() for i in xrange(num)],:]

    # Compute

    beg_time = time.clock()

    # Mondo slow.
    # dist = numpy.sum (
    #            numpy.apply_along_axis (
    #                numpy.linalg.norm, 1, pt_list - rand_pts))

    # Mondo fast.
    dist = numpy.sum ((numpy.sum ((pt_list-rand_pts)**2, axis=1))**0.5)

    end_time = time.clock()
    elap_time = (end_time - beg_time)

    print elap_time
    print dist

Excellent answer. How about optimization #4: improve its performance using numpy? :-) — Ashwin Nanjappa, Apr 25 '13 at 15:17
Numpy is a 30MB dependency (the Python runtime is about 5MB last time I checked) that only works if your algorithm _can_ be run in parallel. — m4r35n357, Aug 16 '23 at 10:37

score 4 · Answer 2 · edited May 23 '17 at 12:02

4

Some general hints:

Move all your code into a main() function and use the normal

if __name__ == "__main__":
    main()

construct. It greatly improves speed due to variable-scoping. See Why does Python code run faster in a function? for an explanation of why.

Don't use range() since it generates the complete range at once which is slow for large numbers; instead use xrange() which uses a generator.

edited May 23 '17 at 12:02

Community

1
1

answered Apr 25 '13 at 09:09

Fredrik Pihl

44,604
7
83
130

Yes, that helped. It reduced it to 1.34s. Still ~100x slower than C code. – Ashwin Nanjappa Apr 25 '13 at 09:11
@FredrikPihl *A point is a list of 3 Python floats.* Using `xrange` will slow it down because of the generator overhead in this case of small numbers – jamylak Apr 25 '13 at 09:12
I was more thinking of the line `for i in range(num)` where num is 1000000. If it is python3, yes then range is the same as xrange. Move code into main() still holds though. – Fredrik Pihl Apr 25 '13 at 09:14
@FredrikPihl Thanks for this tip, which I was not aware of. But sadly, this doesn't help my problem since get_dist is part of a larger program. See my updated question. – Ashwin Nanjappa Apr 25 '13 at 09:26

score 3 · Answer 3 · edited Apr 25 '13 at 10:24

3

Python is not a fast language, it does not produce "computer-code", it's run in the python virtual machine. "Everything" is objects, so you don't have static typing as in C. Only this will slow it down a lot. - Anyways, that's not my area so I will not speak to much of it.

You should consider PyPy, Cython, maybe even writing a python extension in C.

I ran the code in PyPy, the time used was 250ms <-- That's what you're looking for? I wrote a quick test for Cython and managed to get it down to 500ms..

So the best bet would be to use PyPy, or Cython when speed is REALLY important.

edited Apr 25 '13 at 10:24

Martijn Pieters

1,048,767
296
4,058
3,343

answered Apr 25 '13 at 09:21

b0bz

2,140
1
27
27

I do not see why this was down-voted. Cython is a great way of optimising the code when speed is needed. It's easy to handle and simple to maintain. – b0bz Apr 25 '13 at 09:34
I cannot use anything other than CPython right now. My actual program is built using lots of third party libraries. – Ashwin Nanjappa Apr 25 '13 at 09:35
@Ashwin: Cython is not a separate implementation. Cython works *together* with CPython; you use a subset of Python that is then compiled to optimized C, and that extension is then run *within CPython*. – Martijn Pieters Apr 25 '13 at 10:15
@MartijnPieters I realize that now :-) I had assumed it yet another Python implementation like Jython and IronPython. – Ashwin Nanjappa Apr 25 '13 at 10:17

jamylak · Answer 4 · 2013-04-25T09:27:02.400

2

You can't expect to match C++ performance in Python, however you can tweak the Python code a bit to make it faster:

def get_dist(pt0, pt1):
    val = 0
    for i in range(3):
        val += (pt0[i] - pt1[i]) ** 2
    val = math.sqrt(val)
    return val

The for loop version of this code and your C++ for loop are completely different. The Python version creates a list and then iterates through it, while the C++ version simply increments a variable. If you want to speed up the Python version, the best way to do it is write it out explicitly to spare the overhead of the Python for loop.

def get_dist(pt0, pt1, sqrt=math.sqrt): # cache function at definition time
    return sqrt((pt0[0] - pt1[0]) ** 2 + (pt0[1] - pt1[1]) ** 2 + (pt0[2] - pt1[2]) ** 2)

And that's probably as fast as you can get (without using numpy) for that particular function, there are other things you can improve in your main code too.

edited Apr 25 '13 at 09:27

answered Apr 25 '13 at 09:11

jamylak

128,818
30
231
230

2

*And that's probably as fast as you can get*. Using `numpy` will make this a *lot* faster still without moving wholesale to C or C++. – Martijn Pieters Apr 25 '13 at 09:13
@MartijnPieters Oh right, I should mention "without using numpy" thanks – jamylak Apr 25 '13 at 09:13
Also, you could 'cache' the `sqrt` function with `get_dist(pt0, pt1, sqrt=math.sqrt): return sqrt((pt0[0] - pt1[0]) ** 2 + (pt0[1] - pt1[1]) ** 2 + (pt0[2] - pt1[2]) ** 2)` – Martijn Pieters Apr 25 '13 at 09:16
This [link](http://stackoverflow.com/questions/327002/which-is-faster-in-python-x-5-or-math-sqrtx) says math.sqrt() is faster – Fredrik Pihl Apr 25 '13 at 09:16
1

Oh didn't know that, I'll change it – jamylak Apr 25 '13 at 09:17
Unrolling the loop, helped a little bit. The time dropped from 1.34s to 1.18s. – Ashwin Nanjappa Apr 25 '13 at 09:30
@MartijnPieters It would be great if can add an answer showing the optimized code using numpy for get_dist – Ashwin Nanjappa Apr 25 '13 at 09:53
@MartijnPieters I added `pt_list = numpy.array(pt_list)` before doing the distance computation. Code got further slower: 4.6s – Ashwin Nanjappa Apr 25 '13 at 10:29
@Ashwin: Are you letting `numpy` do the calculations? Did you specify the array dimensionality and dtypes? *Just* using the array to hold the values is not going to do anything for you as you then have to translate between numpy values and Python values for all rows. – Martijn Pieters Apr 25 '13 at 10:32
Using numpy at this level will not help. Just the opposite: It will increase computation times, by quite a bit. There's a significant overhead involved in converting Python lists into numpy ndarrays. To take advantage of numpy you need to make it operate on multidimensional arrays rather than vectors. – David Hammen Apr 25 '13 at 21:16

score 0 · Answer 5 · edited May 23 '17 at 10:29

0

This page is getting really confusing and most of the answers are actually in comments, so here’s a quick overview of possible optimizations:

Jamlak’s answer: optimize your python code:

def get_dist(pt0, pt1, sqrt=math.sqrt):  # cache function at definition time
    return sqrt((pt0[0] - pt1[0]) ** 2 + (pt0[1] - pt1[1]) ** 2 + (pt0[2] - pt1[2]) ** 2)

Use the numpy module for the calulations
Run your code with pypy instead of CPython
Complie the time-critical code with Cython

edited May 23 '17 at 10:29

Community

1
1

answered Apr 25 '13 at 10:00

Chronial

66,706
14
93
99

1

You missed the 'move the code into a function'. Running the supplied code on my system: 4.8s, moving the otherwise unchanged code into `main()`: 3.5s, running it with pypy 0.65s. numpy probably helps more but would need code changes. – Duncan Apr 25 '13 at 10:27
At some point it becomes quicker to just learn c or Fortran ;) – m4r35n357 Aug 16 '23 at 10:40