number of loops matters efficiency (interpreted vs compiled languages?)

Question

Say you have to carry out a computation by using 2 or even 3 loops. Intuitively, one may thing that it's more efficient to do this with a single loop. I tried a simple Python example:

import itertools
import timeit

def case1(n):
    c = 0
    for i in range(n):
        c += 1
    return c

def case2(n):
    c = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                c += 1
    return c

print(case1(1000))
print(case2(10))

if __name__ == '__main__':
    import timeit

    print(timeit.timeit("case1(1000)", setup="from __main__ import case1", number=10000))

    print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))

This code run:

$ python3 code.py 
1000
1000
0.8281264099932741
1.04944919400441

So effectively 1 loop seems to be a bit more efficient. Yet I have a slightly different scenario in my problem, as I need to use the values in an array (in the following example I use the function range for simplification). That is, if I collapse everything to a single loop I would have to create an extended array from the values of another array whose size is between 2 and 10 elements.

import itertools
import timeit

def case1(n):

    b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
    c = 0
    for i in range(len(b)):
        c += b[i]
    return c

def case2(n):

    c = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                c += i*j*k
    return c

print(case1(10))
print(case2(10))

if __name__ == '__main__':
    import timeit

    print(timeit.timeit("case1(10)", setup="from __main__ import case1", number=10000))

    print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))

In my computer this code run in:

$ python3 code.py 
91125
91125
2.435348572995281
1.6435037050105166

So it seems the 3 nested loops are more efficient because I spend sometime creating the array b in case1. so I'm not sure I'm creating this array in the most efficient way, but leaving that aside, does it really pay off collapsing loops to a single one? I'm using Python here, but what about compiled languages like C++? Does the compiler in this case do something to optimize the single loop? Or on the other hand, does the compiler do some optimization when you have multiple nested loops?

In the second example the first is an handcrafted questionable optimization making the code more complex and harder to optimize by the compiler and by the cpu. Also, it uses more memory. — , May 01 '15 at 10:05
Why not `c = sum(i * j * k for i, j, k in itertools.product(range(n), repeat=3))`? — jonrsharpe, May 01 '15 at 10:14
@jonrsharpe I can't do that because the code I showed is just for showing the problem. In the real application I do some other stuff (linear algebra) inside the loop that uses the result of that array. — aaragon, May 01 '15 at 10:21
@aaragon so you want us to try to micro-optimise an unseen algorithm? That's unlikely to be very productive. I would suggest you implement, test and profile to find the bottlenecks. — jonrsharpe, May 01 '15 at 10:23
@jonrsharpe, the goal of my post is to try to understand what really happens when you deal with either 1 or more loops, not to end up having an optimized code. I wrote a Python example that showed me what I expected intuitively. Yet, I would like to know what happens in compiled code, and the question may be too obvious to address for some people. — aaragon, May 01 '15 at 10:32
You can look at the bytecode, if you like: https://docs.python.org/2/library/dis.html — jonrsharpe, May 01 '15 at 10:52
@jonrsharpe I took your first comment into consideration to build the array but it doesn't improve on the running time (as you see in the edited post). — aaragon, May 01 '15 at 15:00
You do realize that `itertools.product()` is basically just nested for-loops under the hood, right? — Kevin, May 01 '15 at 15:07
If you have to perform 1000 computations, it's no more or less efficient whether you use a single 1000-iteration loop, two nested loops of 10 and 100 iterations, or three nested loops of 10 iterations each. Indeed, various types of computations lend themselves more readily to any one of those scenarios (or even others). What you should focus on is how you can reduce the total number of computations needed, not so much on the way they're structured/organized (at least not initially - leave the cache-line considerations, etc. for later optimization). — twalberg, May 01 '15 at 16:00

score 2 · Answer 1 · answered Jun 07 '15 at 18:07

This is why the single loop function takes supposedly longer than it should

b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]

Just by changing the whole function to

def case1(n, b):
    c = 0
    for i in range(len(b)):
        c += b[i]
    return c

Makes the timeit return :

case1 : 0.965343249744
case2 : 2.28501694207

schlenk · Answer 2 · 2015-06-07T19:07:03.340

Your case is simple enough that various optimizations would probably do a lot. Be it numpy for more efficient array's, maybe pypy for a better JIT optimizer, or various other things.

Looking at the bytecode via the dis module can help you understand what happens under the hood and make some micro optimizations, but in general it does not really matter if you do one loop or a nested loop, if your memory access pattern is somewhat predictable for the CPU. If not, it may differ wildly.

Python has some bytecodes that are cheap and others that are more expensive, e.g. function calls are much more expensive than a simple addition. Same with creating new objects and various other things. So the usual optimization is moving the loop to C, which is one of the benefits of itertools sometimes.

Once you are on the C-level it usually comes down to: Avoid syscalls/mallocs() in tight loops, have predictable memory access patterns and make sure your algorithm is cache friendly.

So, your algorithms above will probably vary wildly in performance if you go to large values of N, due to the amount of memory allocation and cache access.

But the fastest way for the specific problem above would be to find a closed form for the function, it seems wasteful to iterate for that, as there must be a much simpler formula to calculate the final value of 'c'. As usual, first get the best algorithm before doing micro optimizations.

e.g. Wolfram Alpha tells you that you could replace two loops with, there is probably a closed form for all three, but Alpha didn't tell me...

def case3(n):
    c = 0
    for j in range(n):
        c += (j* n^2 *(n+1)^2))/4
    return c

number of loops matters efficiency (interpreted vs compiled languages?)

2 Answers2