Optimizing performance of list comprehension using all indexes in own functions

Question

I have a multi-dimensional matrix (6D) that I need to iterate over in order to make a new 6D-matrix. Right now I use list comprehensions to make the code as clean as possible, however it is really small. I was hoping there were some build-in numpy function to help me out, but because of the own functions used within the lists it is hard to find such functions.

I already tried np.fromIter, but this errors, because I use a multidimensional list. World.allReachableCoords(x1, y1, len(Q1), len(Q1[0]) returns a set of all surrounding coordinates ({(x1, y1), (x1 + 1, y1), (x1, y1 + 1) ...}) and world.amountOfPossibleActions just returns 5.

The algorithm starts with

Q1 = np.zeros((heightWorld, widthWorld, heightWorld, widthWorld, world.amountOfPossibleActions,
               world.amountOfPossibleActions))

and then iterates the process below several times.

Q1 = np.array([[[[[[sum(
        world.joinedTransition((x1, y1), sf1, (x2, y2), sf2, action1, action2) *
        (world.joinedU((x1, y1), sf1, (x2, y2), sf2, action1, action2, player) +
         world.joinedU((x1, y1), sf1, (x2, y2), sf2, action2, action1, otherPlayer) +
         gamma * np.amax(Q1[sf1[1]][sf1[0]][sf2[1]][sf2[0]]))
        for sf1 in World.allReachableCoords(x1, y1, len(Q1), len(Q1[0]), world)
        for sf2 in World.allReachableCoords(x2, y2, len(Q1), len(Q1[0]), world)
    )
        for action1 in range(world.amountOfPossibleActions)]
        for action2 in range(world.amountOfPossibleActions)]
        for x1 in range(widthWorld)] for y1 in range(heightWorld)]
        for x2 in range(widthWorld)] for y2 in range(heightWorld)])

where the joined transition is mostly a string of if-statements:

# Transition function: Returns 0 if the final state is out of bounds, impassable terrain or too far from the
# initial state. If the given action explains the relation between si and sf return 1, otherwise 0.
def standardTransition(self, si, sf, a):
    if not (0 <= sf[0] <= len(self.grid[0]) and 0 <= sf[1] <= len(self.grid)):
        return 0
    if not (0 <= si[0] <= len(self.grid[0]) and 0 <= si[1] <= len(self.grid)):
        return 0
    if self.grid[sf[1]][sf[0]] == self.i or self.grid[si[1]][si[0]] == self.i:
        return 0
    if abs(si[0] - sf[0]) > 1 or abs(si[1] - sf[1]) > 1:
        return 0
    return {
        0: 1 if sf[0] == si[0] and sf[1] == si[1] else 0,  # Stay
        1: 1 if sf[0] == si[0] and sf[1] == si[1] + 1 else 0,  # Down
        2: 1 if sf[0] == si[0] and sf[1] == si[1] - 1 else 0,  # Up
        3: 1 if sf[0] == si[0] - 1 and sf[1] == si[1] else 0,  # Left
        4: 1 if sf[0] == si[0] + 1 and sf[1] == si[1] else 0  # Right
    }[a]

def joinedTransition(self, si1, sf1, si2, sf2, a1, a2):
    if sf1 == sf2: return 0  # Ending in the same square is impossible.
    if si1 == sf2 and si2 == sf1: return 0  # Going through each other is impossible.
    # Fighting for the same square.
    if si1 == sf1 and performAction(si1, a1) == sf2:  # Player 1 loses the fight
        return self.standardTransition(si1, sf2, a1) * self.standardTransition(si2, sf2,
                                                                               a2) * self.chanceForPlayer1ToWinDuel
    if si2 == sf2 and performAction(si2, a2) == sf1:  # Player 2 loses the fight
        return self.standardTransition(si1, sf1, a1) * self.standardTransition(si2, sf1, a2) * (
                1 - self.chanceForPlayer1ToWinDuel)
    return self.standardTransition(si1, sf1, a1) * self.standardTransition(si2, sf2, a2)

and, allReachableCoords is like said above:

def allReachableCoords(x1, y1, height, width, world):
    li = {(x2, y1) for x2 in range(x1 - 1, x1 + 2)}.union({(x1, y2) for y2 in range(y1 - 1, y1 + 2)})
    li = list(filter(lambda r: 0 <= r[0] < width and 0 <= r[1] < height, li))
    return list(filter(lambda r: world.grid[r[1]][r[0]] != world.i, li))

Are there any ways to improve performance? I suppose the solution is numpy, but other solutions are also welcome. I was also wondering if this is something that can be done more elegantly and efficiently in tensorflow.

To get better feedback, consider including example input, example output, and sample code (including runnable sample definitions of `f1` and `f2` and `World.allReachableCoords`). Based on your example, it seems likely that performance can be improved; however, it's hard to say anything useful without knowing the details. — hilberts_drinking_problem, Aug 23 '19 at 11:16
By the way, there is a moderate limit to cases when list comprehensions make code more "clean", and I fear that you may have crossed that line here. Just my opinion :) — hilberts_drinking_problem, Aug 23 '19 at 11:26
Yes, I am aware, but the 6 dimension need to be traversed, so it is either this or for loops. As long as there is no more elegant way to do the innner operations. — Emiel Lanckriet, Aug 23 '19 at 12:05

score 2 · Accepted Answer · answered Sep 01 '19 at 23:07

Measure: cProfile / line_profiler

The first step in speeding up a program should always be to measure: where exactly is your time being spent? There will always be things that can be faster/neater, but if speed is your main concern, you want to tackle the slowest parts of your code first.

To start, you can always use the default profiler cProfile that comes with Python. For a slightly more detailed view per line of code, I would recommend looking at line_profiler. Although the setup is a bit more involved, it can give you better results if the time is mostly spent in operations rather than functions.

Timeit experiments

Given that I don't know any profiling results of your code, there are still some other things I noticed. After running a bunch of small experiments with python's built-in timeit module, here are a few explicit suggestions to make your code faster, cleaner or both.

Numpy Indexing

An initial way to improve performance might just be to change your indexing. When you index an array in numpy, it seems to return an intermediate object. So each new set of square brackets is a new __getitem__ function call, with all associated overhead. This means your Q1[v][w][x][y] is (sort-of) translated as

Q1.__getitem__(v).__getitem__(w).__getitem__(x).__getitem__(y)

Numpy natively supports indexing by tuple, which you can use without explicitly making a tuple:

Q1[v][w][x][y]  # This is slow
Q1[(v,w,x,y)]   # This is faster
Q1[v,w,x,y]     # This does the same thing

By making use of the latter, you can already save about half of the time it takes to index the item you're looking for.

$python -m timeit -s "import numpy as np; Q1 = np.empty((9,9,9,9,9,9)); sf=(3,4)" "q = Q1[sf[0]][sf[1]][sf[0]][sf[1]]"
1000000 loops, best of 3: 0.651 usec per loop

$python -m timeit -s "import numpy as np; Q1 = np.empty((9,9,9,9,9,9)); sf=(3,4)" "q = Q1[sf[0],sf[1],sf[0],sf[1]]"
1000000 loops, best of 3: 0.298 usec per loop

i.e. replacing np.amax(Q1[sf1[1]] [sf1[0]] [sf2[1]] [sf2[0]]) by np.amax(Q1[sf1[1], sf1[0], sf2[1], sf2[0]]).

Additionally, you can unpack your sf variables in the for loop instead (for sf1_0, sf1_1 in ...), shaving off another bit of time:

$python -m timeit -s "import numpy as np; Q1 = np.empty((9,9,9,9,9,9)); sf_0, sf_1=(3,4)" "q = Q1[sf_0,sf_1,sf_0,sf_1]"
1000000 loops, best of 3: 0.216 usec per loop

giving np.amax(Q1[sf1_1, sf1_0, sf2_1, sf2_0]), which I think is a bit cleaner too :)

Iterating: itertools.product

You're currently manually/explicitly looping over ranges, but you only actually have a computation in the innermost loop. This means that in the five loops outside of it, all you're doing is creating range objects and exhausting them. Performance-wise, this is not a large bottleneck, but it's not very clean. Luckily, the built-in itertools library has the product tool for exactly these kinds of tasks:

# 6 nested loops
$python -m timeit -n 100 -s "a=0" "for u in range(10):" 
                                  "    for v in range(10):" 
                                  "        for w in range(10):" 
                                  "            for x in range(10):" 
                                  "                for y in range(10):" 
                                  "                    for z in range(10):" 
                                  "                        a+=1"
100 loops, best of 3: 57.3 msec per loop

# itertools.product
$python -m timeit -n 100 -s "from itertools import product; a=0" 
                            "for u,v,w,x,y,z in product(range(10),range(10),range(10),range(10),range(10),range(10)):" 
                            "    a+=1"
100 loops, best of 3: 55.6 msec per loop

In this example it saves 5(!) levels of nesting, while even being marginally faster. It can be a bit faster still by creating the product() just once up front and re-using it later, as you're iterating over the same loops the whole time. Just make sure to explicitly take a list() of it, as product() will return a generator that is empty if you try to use it twice (see e.g. here for more information).

# Explicitly storing the resulting `product()` up front
$python -m timeit -n 100 -s "from itertools import product; a=0; it=list(product(range(10),range(10),range(10),range(10),range(10),range(10)))" 
                            "for u,v,w,x,y,z in it:" 
                            "    a+=1"
$100 loops, best of 3: 52.8 msec per loop

Caching

In your inner loop you also call a bunch of methods of your World object. If the results of those methods do not depend on Q1 at all, you're definitely recalculating the same thing a couple of times. You could then trade computation time for memory: pre-calculate all values once and store them in another numpy array. An array look-up is almost guaranteed to be (much) faster than a function call with computations.

To decide where to do this first, you should refer to the results of your profiling efforts ;)