Why is in my case For loop faster vs Map, Reduce and List comprehension

Question

I wrote a simple script that test the speed and this is what I found out. Actually for loop was fastest in my case. That really suprised me, check out bellow (was calculating sum of squares). Is that because it holds list in memory or is that intended? Can anyone explain this.

from functools import reduce
import datetime


def time_it(func, numbers, *args):
    start_t = datetime.datetime.now()
    for i in range(numbers):
        func(args[0])
    print (datetime.datetime.now()-start_t)

def square_sum1(numbers):
    return reduce(lambda sum, next: sum+next**2, numbers, 0)


def square_sum2(numbers):
    a = 0
    for i in numbers:
        i = i**2
        a += i
    return a

def square_sum3(numbers):
    sqrt = lambda x: x**2
    return sum(map(sqrt, numbers))

def square_sum4(numbers):
    return(sum([i**2 for i in numbers]))


time_it(square_sum1, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum2, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum3, 100000, [1, 2, 5, 3, 1, 2, 5, 3])
time_it(square_sum4, 100000, [1, 2, 5, 3, 1, 2, 5, 3])

0:00:00.302000 #Reduce
0:00:00.144000 #For loop
0:00:00.318000 #Map
0:00:00.290000 #List comprehension`

Update - when I tried longer loops there are the results.

time_it(square_sum1, 100, range(1000))
time_it(square_sum2, 100, range(1000))
time_it(square_sum3, 100, range(1000))
time_it(square_sum4, 100, range(1000))

0:00:00.068992
0:00:00.062955
0:00:00.069022
0:00:00.057446

BTW, that's an unusual way to measure execution speed. It's more common to use `time.time`. But it's even better to use the `timeit` module. You should try `sum` with a generator expression instead of a list comprehension: `sum(i**2 for i in numbers)`; also try `sum(i*i for i in numbers)`. — PM 2Ring, Apr 28 '17 at 10:23
Nice question; I like that you really put all the relevant information into it; in such a concise way! — GhostCat, Apr 28 '17 at 11:41

PM 2Ring · Accepted Answer · 2022-10-07T06:23:09.967

Python function calls have overheads which make them relatively slow, so code that uses a simple expression will always be faster than code that wraps that expression in a function; it doesn't matter whether it's a normal def function or a lambda. For that reason, it's best to avoid map or reduce if you are going to pass them a Python function if you can do the equivalent job with a plain expression in a for loop or a comprehension or generator expression.

There are a couple of minor optimizations that will speed up some of your functions. Don't make unnecessary assignments. Eg,

def square_sum2a(numbers):
    a = 0
    for i in numbers:
        a += i ** 2
    return a

Also, i * i is quite a bit faster than i ** 2 because multiplication is faster than exponentiation.

As I mentioned in the comments, it's more efficient to pass sum a generator than a list comprehension, especially if the loop is large; it probably won't make difference with a small list of length 8, but it will be quite noticeable with large lists.

sum(i*i for i in numbers)

As Kelly Bundy mentions in the comments, the generator expression version isn't actually faster than the equivalent list comprehension. Generator expressions are more efficient than list comps in terms of RAM use, but they're not necessarily faster. And when the sequence length is small, the RAM usage differences are negligible, although there is also the time required to allocate & free the RAM used.

I just ran a few tests, with a larger data list. The list comp is still the winner (usually), but the speed differences are generally around 5-10%.

BTW, you shouldn't use sum or next as variable names as that masks the built-in functions with the same names. It won't hurt anything here, but it's still not a good idea, and it makes your code look odd in an editor with more comprehensive syntax highlighting than the SO syntax highlighter.

Here's a new version of your code that uses the timeit module. It does 3 repetitions of 10,000 loops each and sorts the results. As explained in the timeit docs, the important figure to look at in the series of the repetitions is the minimum one.

In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in.

from timeit import Timer
from functools import reduce

def square_sum1(numbers):
    return reduce(lambda total, u: total + u**2, numbers, 0)

def square_sum1a(numbers):
    return reduce(lambda total, u: total + u*u, numbers, 0)

def square_sum2(numbers):
    a = 0
    for i in numbers:
        i = i**2
        a += i
    return a

def square_sum2a(numbers):
    a = 0
    for i in numbers:
        a += i * i
    return a

def square_sum3(numbers):
    sqr = lambda x: x**2
    return sum(map(sqr, numbers))

def square_sum3a(numbers):
    sqr = lambda x: x*x
    return sum(map(sqr, numbers))

def square_sum4(numbers):
    return(sum([i**2 for i in numbers]))

def square_sum4a(numbers):
    return(sum(i*i for i in numbers))

funcs = (
    square_sum1,
    square_sum1a,
    square_sum2,
    square_sum2a,
    square_sum3,
    square_sum3a,
    square_sum4,
    square_sum4a,
)

data = [1, 2, 5, 3, 1, 2, 5, 3]

def time_test(loops, reps):
    ''' Print timing stats for all the functions '''
    timings = []
    for func in funcs:
        fname = func.__name__
        setup = 'from __main__ import data, ' + fname
        cmd = fname + '(data)'
        t = Timer(cmd, setup)
        result = t.repeat(reps, loops)
        result.sort()
        timings.append((result, fname))

    timings.sort()
    for result, fname in timings:
        print('{0:14} {1}'.format(fname, result))

loops, reps = 10000, 3
time_test(loops, reps)

output

square_sum2a   [0.03815755599862314, 0.03817843700016965, 0.038571521999983815]
square_sum4a   [0.06384095800240175, 0.06462285799716483, 0.06579178199899616]
square_sum3a   [0.07395686000018031, 0.07405958899835241, 0.07463337299850537]
square_sum1a   [0.07867341000019223, 0.0788448769999377, 0.07908406700153137]
square_sum2    [0.08781023399933474, 0.08803317899946705, 0.08846573399932822]
square_sum4    [0.10260082300010254, 0.10360279499946046, 0.10415067900248687]
square_sum3    [0.12363515399920288, 0.12434166299863136, 0.1273790529994585]
square_sum1    [0.1276186039976892, 0.13786184099808452, 0.16315817699796753]

The results were obtained on an old single core 32 bit 2GHz machine running Python 3.6.0 on Linux.

Yes, what you wrote make sense, also `sum(i*i for i in numbers)` speeds up a lot (about 300%) — alphiii, Apr 28 '17 at 10:39
@alphiii As you can see from my updated version `timeit` makes the speed differences more obvious. A major reason for that is that `timeit` disables garbage collection while it's doing the timing tests, so the times don't get masked by the time taken to do that garbage collection. — PM 2Ring, Apr 28 '17 at 10:56
I think you're mistaken about sum(genexp) being faster than sum(listcomp). Your own benchmark even indicates the opposite. Your 1-vs-1a and 3-vs-3a both show that using `i*i` was 0.049 seconds faster than using `i**2`. If you subtract that from the sum(listcomp) solution's time, you get 0.0536 seconds, which is **faster** than the 0.0638 seconds of the sum(genexp) solution. — Kelly Bundy, Oct 07 '22 at 04:33
Thanks. What do you mean with `timeit` "ignoring" the time to allocate & free the RAM used? It sounds like you mean it's not included in the reported time, but I don't think that's true. Do you mean the disabled garbage collection? — Kelly Bundy, Oct 07 '22 at 05:53
FWIW, https://docs.python.org/3/c-api/memory.html has details of how Python manages memory, but it's a fairly technical article, primarily written for C programmers. — PM 2Ring, Oct 07 '22 at 06:09
Yeah, when using `timeit` I usually keep `gc` disabled, unless the code I'm measuring creates/deletes a lot of lists/dicts/... (so that the `gc` cost *is* largely caused by the code I'm measuring and should be included). In this case of sum(listcomp), only a single list is created and deleted (by recount, not by `gc`), so if I'm not mistaken, it costs practically no `gc` time (only registering/unregistering it for `gc` once) and "ignoring" *that* makes no difference. Allocation/freeing might take more time, but that *is* included. In either case, that sentence doesn't seem quite right. — Kelly Bundy, Oct 07 '22 at 06:21
Ok looks good now :-). One last thought: I guess at some large size, sum(listcomp) must indeed become slower than sum(genexp), either already when the data (the list and its elements) becomes too large for the caches or at least when it becomes too large for memory and gets swapped to drive. But from my experience, usually benchmarking with a million or so small elements, listcomps tend to be a little faster than genexps. — Kelly Bundy, Oct 07 '22 at 06:31
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248622/discussion-between-pm-2ring-and-kelly-bundy). — PM 2Ring, Oct 07 '22 at 06:34

score 2 · Answer 2 · answered Apr 28 '17 at 10:55

This is almost independent of the underlying programming language, as in abstractions do not come for free.

Meaning: there is always certain cost for calling methods. A stack needs to be established; control flow needs to "jump". And when you think of lower levels, such as CPUs: probably the code for that method needs to be loaded, and so on.

In other words: when your primary requirement is hard-core number crunching, then you have to balance ease-of-use with the cost of the corresponding abstractions.

Beyond: if you focus on speed, then you should look beyond python, or at least beyond "ordinary" python. Instead you could turn to modules such as numpy.

Why is in my case For loop faster vs Map, Reduce and List comprehension

2 Answers2

Linked