Python built-in sum function vs. for loop performance

Question

I noticed that Python's built-in sum function is roughly 3x faster than a for loop when summing a list of 1 000 000 integers:

import timeit

def sum1():
    s = 0
    for i in range(1000000):
        s += i
    return s

def sum2():
    return sum(range(1000000))

print 'For Loop Sum:', timeit.timeit(sum1, number=10)
print 'Built-in Sum:', timeit.timeit(sum2, number=10)

# Prints:
# For Loop Sum: 0.751425027847
# Built-in Sum: 0.266746997833

Why is that? How is sum implemented?

`sum` is implemented in C inside the Python interpreter, while your for loop has to be interpreted, it's normal that it's slower. — Matteo Italia, Jul 04 '14 at 17:47
In *CPython* built-in functions are much faster than the pure-python translation. This is why you a good way to optimize, *for CPython*, is to let built-in functions do as much work as possible. Note that this changes completely using other implementations, such as PyPy. — Bakuriu, Jul 04 '14 at 17:49
What about using numpy? You of course would need to make the array first, so for a one-time use, I think it's a bit (a bunch) slower; but if you've already got the array handy, I think `arr.sum()` is faster. — dwanderson, Jul 04 '14 at 18:31
@dwanderson I don't know whether it would be slower even on a one-time use. Getting the value from a number is quite easy and efficient, so creating the array will probably take less time than summing the numbers (which requires also performing additions). Then computing the sum should take *much* less time, so it might be faster. However numpy has one big problem: it uses fixed-size integers, so with long arrays it can easily overflow or you have to use an array of `object`s, which would decrease the performances a lot. — Bakuriu, Jul 04 '14 at 19:08
@Bakuriu: see my answer. It is *much* faster at least with large data. — DrV, Jul 04 '14 at 19:09
@DrV dwanderson and I were speaking of dealing with the summation of a *python list* (which can be arbitrary, `range` is just one case [and a useless one too]) converting the list first to an array and then calling `.sum()`. In your answer you don't profile this case, which is what matters. It's not always possible to use a numpy array from the start for different reasons, so you must take into account the time to convert from python list to array. — Bakuriu, Jul 04 '14 at 19:14
@Bakuriu: I did not mention that case, because it is awfully slow. I edited it into my answer. It is quite funny it is so slow, as the data is exactly in the same form in memory. A simple memory allocation and copy should not take that much time. I guess `numpy` cannot access the raw form of `list` but is forced to use slow access methods. Pity, this has bitten me too many times. — DrV, Jul 04 '14 at 19:18
@DrV Note that integers are *objects* in python. So the list is actually an array of pointers to `struct`s that contain a field that has the integer value, so there's a lot of dereferencing to be done. Also it should be *much* faster in python2 with lists of small `int`s instead of `long`s. In python3 all integers are `long`s and thus a conversion is necessary to obtain the actual value, which might be costly. In any case: creating a numpy array from a list of integer is *not* a simple copy of a C array. — Bakuriu, Jul 04 '14 at 19:22

Martijn Pieters · Accepted Answer · 2014-07-04T18:05:43.690

The speed difference is actually greater than 3 times, but you slow down either version by first creating a huge in-memory list of 1 million integers. Separate that out of the time trials:

>>> import timeit
>>> def sum1(lst):
...     s = 0
...     for i in lst:
...         s += i
...     return s
... 
>>> def sum2(lst):
...     return sum(lst)
... 
>>> values = range(1000000)
>>> timeit.timeit('f(lst)', 'from __main__ import sum1 as f, values as lst', number=100)
3.457869052886963
>>> timeit.timeit('f(lst)', 'from __main__ import sum2 as f, values as lst', number=100)
0.6696369647979736

The speed difference has risen to over 5 times now.

A for loop is executed as interpreted Python bytecode. sum() loops entirely in C code. The speed difference between interpreted bytecode and C code is large.

In addition, the C code makes sure not to create new Python objects if it can keep the sum in C types instead; this works for int and float results.

The Python version, disassembled, does this:

>>> import dis
>>> def sum1():
...     s = 0
...     for i in range(1000000):
...         s += i
...     return s
... 
>>> dis.dis(sum1)
  2           0 LOAD_CONST               1 (0)
              3 STORE_FAST               0 (s)

  3           6 SETUP_LOOP              30 (to 39)
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_CONST               2 (1000000)
             15 CALL_FUNCTION            1
             18 GET_ITER            
        >>   19 FOR_ITER                16 (to 38)
             22 STORE_FAST               1 (i)

  4          25 LOAD_FAST                0 (s)
             28 LOAD_FAST                1 (i)
             31 INPLACE_ADD         
             32 STORE_FAST               0 (s)
             35 JUMP_ABSOLUTE           19
        >>   38 POP_BLOCK           

  5     >>   39 LOAD_FAST                0 (s)
             42 RETURN_VALUE

Apart from the interpreter loop being slower than C, the INPLACE_ADD will create a new integer object (past 255, CPython caches small int objects as singletons).

You can see the C implementation in the Python mercurial code repository, but it explicitly states in the comments:

/* Fast addition by keeping temporary sums in C instead of new Python objects.
   Assumes all inputs are the same type.  If the assumption fails, default
   to the more general routine.
*/

Unless you have 64-bit C longs, the sum will pretty quickly outgrow what the special case for ints can handle. — user2357112, Jul 04 '14 at 18:05
+1 for the mention of the singletons. Back when I started, I made the mistake of checking number equality with `is` and was surprised that sometimes it worked. Now though, I'm seeing that, say, `65536 is 65536` returns `True`, `1<<16 is 1<<16` returns `False`, and `1<<8 is 1<<8` returns `True`, so I guess it goes to 256? And special-cases hardcoded numbers? I'm actually more confused now... — dwanderson, Jul 04 '14 at 18:06
@user2357112: Check the source; the loop for the integer case starts with `long i_result = PyInt_AS_LONG(result)`. The moment it overflows, it switches back to using Python `long` objects. — Martijn Pieters, Jul 04 '14 at 18:07
@dwanderson: Python *also* creates constants in compiled bytecode for any immutable literal values you use in code (even in the interactive interpreter). That extends to many calculated constants too, but not for `<<` bitwise shifting. — Martijn Pieters, Jul 04 '14 at 18:07
@MartijnPieters: I know. I'm saying that the int special case probably doesn't have a major impact on performance here, since we exit it about 6% of the way through the list. — user2357112, Jul 04 '14 at 18:08
Ahh ok, thanks! I'll need to go get more familiar with `dis`, since it seems useful to know about how the bytecode is working under the hood — dwanderson, Jul 04 '14 at 18:09
@dwanderson: you see this in the disassembly in this answer too; `LOAD_CONST` is used for the `1000000` value for the loop. — Martijn Pieters, Jul 04 '14 at 18:09
@user2357112: no, `sum(range(1000000))` fits well within my `sys.maxint` value. — Martijn Pieters, Jul 04 '14 at 18:11
@user2357112: for a 32-bit platform you are correct, you hit `sys.maxint` after about 9%. — Martijn Pieters, Jul 04 '14 at 18:12

DrV · Answer 2 · 2014-07-04T19:16:27.923

As dwanderson suggested, Numpy is one alternative. It is, indeed, if you want to do some maths. See this benchmark:

import numpy as np

r = range(1000000)       # 12.5 ms
s = sum(r)               # 7.9 ms

ar = np.arange(1000000)  # 0.5 ms
as = np.sum(ar)          # 0.6 ms

So both creating the list and summing it is much faster with numpy. This is mostly because the numpy.array is designed for this and is much more efficient than the list.

However, if we have a python list, then numpy is very slow, as its conversion from a list into a numpy.array is sluggish:

r = range(1000000)
ar = np.array(r)         # 102 ms

score 3 · Answer 3 · answered May 23 '20 at 20:08

However if the loop is just adding 1 each iteration starting from 0 you could use the fast trick addition. The sum output should be 499999500000 for range(1000000)

import timeit

def sum1():
    s = 0
    for i in range(1000000):
        s += i
    #print s    
    return s

def sum2():

    return sum(range(1000000))

def sum3():
    s = range(1000000)
    s = ((s[1]+s[-1])/2) * (len(s)-1)
    #print(s)
    return s

print 'For Loop Sum:', timeit.timeit(sum1, number=10)
print 'Built-in Sum:', timeit.timeit(sum2, number=10)
print 'Fast Sum:', timeit.timeit(sum3, number=10)

#prints
#For Loop Sum: 1.8420711
#Built-in Sum: 1.1081646
#Fast Sum: 0.3191561

The sum3 loop is quite helpful in this case as it computes the sum in `O(1)` time. but it can only be used for some problems. — Muhammad Khuzaima Umair, Aug 24 '22 at 00:50

score 1 · Answer 4 · answered Jul 04 '14 at 17:51

You can see the source code in Python/bltinmodule.c. It has special cases for ints and floats, but since the sum overflows to longs pretty quickly, that probably doesn't have a major performance impact here. The general-case logic is pretty similar to what you'd write in Python, just in C. The speedup is most likely due to the fact that it doesn't have to go through all the bytecode interpreting and error handling overhead:

static PyObject*
builtin_sum(PyObject *self, PyObject *args)
{
    PyObject *seq;
    PyObject *result = NULL;
    PyObject *temp, *item, *iter;

    if (!PyArg_UnpackTuple(args, "sum", 1, 2, &seq, &result))
        return NULL;

    iter = PyObject_GetIter(seq);
    if (iter == NULL)
        return NULL;

    if (result == NULL) {
        result = PyInt_FromLong(0);
        if (result == NULL) {
            Py_DECREF(iter);
            return NULL;
        }
    } else {
        /* reject string values for 'start' parameter */
        if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
            PyErr_SetString(PyExc_TypeError,
                "sum() can't sum strings [use ''.join(seq) instead]");
            Py_DECREF(iter);
            return NULL;
        }
        Py_INCREF(result);
    }

#ifndef SLOW_SUM
    /* Fast addition by keeping temporary sums in C instead of new Python objects.
       Assumes all inputs are the same type.  If the assumption fails, default
       to the more general routine.
    */
    if (PyInt_CheckExact(result)) {
        long i_result = PyInt_AS_LONG(result);
        Py_DECREF(result);
        result = NULL;
        while(result == NULL) {
            item = PyIter_Next(iter);
            if (item == NULL) {
                Py_DECREF(iter);
                if (PyErr_Occurred())
                    return NULL;
                return PyInt_FromLong(i_result);
            }
            if (PyInt_CheckExact(item)) {
                long b = PyInt_AS_LONG(item);
                long x = i_result + b;
                if ((x^i_result) >= 0 || (x^b) >= 0) {
                    i_result = x;
                    Py_DECREF(item);
                    continue;
                }
            }
            /* Either overflowed or is not an int. Restore real objects and process normally */
            result = PyInt_FromLong(i_result);
            temp = PyNumber_Add(result, item);
            Py_DECREF(result);
            Py_DECREF(item);
            result = temp;
            if (result == NULL) {
                Py_DECREF(iter);
                return NULL;
            }
        }
    }

    if (PyFloat_CheckExact(result)) {
        double f_result = PyFloat_AS_DOUBLE(result);
        Py_DECREF(result);
        result = NULL;
        while(result == NULL) {
            item = PyIter_Next(iter);
            if (item == NULL) {
                Py_DECREF(iter);
                if (PyErr_Occurred())
                    return NULL;
                return PyFloat_FromDouble(f_result);
            }
            if (PyFloat_CheckExact(item)) {
                PyFPE_START_PROTECT("add", Py_DECREF(item); Py_DECREF(iter); return 0)
                f_result += PyFloat_AS_DOUBLE(item);
                PyFPE_END_PROTECT(f_result)
                Py_DECREF(item);
                continue;
            }
            if (PyInt_CheckExact(item)) {
                PyFPE_START_PROTECT("add", Py_DECREF(item); Py_DECREF(iter); return 0)
                f_result += (double)PyInt_AS_LONG(item);
                PyFPE_END_PROTECT(f_result)
                Py_DECREF(item);
                continue;
            }
            result = PyFloat_FromDouble(f_result);
            temp = PyNumber_Add(result, item);
            Py_DECREF(result);
            Py_DECREF(item);
            result = temp;
            if (result == NULL) {
                Py_DECREF(iter);
                return NULL;
            }
        }
    }
#endif

    for(;;) {
        item = PyIter_Next(iter);
        if (item == NULL) {
            /* error, or end-of-sequence */
            if (PyErr_Occurred()) {
                Py_DECREF(result);
                result = NULL;
            }
            break;
        }
        /* It's tempting to use PyNumber_InPlaceAdd instead of
           PyNumber_Add here, to avoid quadratic running time
           when doing 'sum(list_of_lists, [])'.  However, this
           would produce a change in behaviour: a snippet like

             empty = []
             sum([[x] for x in range(10)], empty)

           would change the value of empty. */
        temp = PyNumber_Add(result, item);
        Py_DECREF(result);
        Py_DECREF(item);
        result = temp;
        if (result == NULL)
            break;
    }
    Py_DECREF(iter);
    return result;
}

Python built-in sum function vs. for loop performance

4 Answers4

Linked