1

I made a random walk generator with python and numpy. Given an adjency matrix, I want to sample random paths from each node. To do so, I currently made this class, which takes as input the neighbors_nodes of each node of the graph with their corresponding probabilities:

import numpy as np

class RandomChoice(object):
    def __init__(self, neighbors_choices, neighbors_prob, depth=50):
        C = len(neighbors_choices)
        self.depth = depth
        self.neighbors_choices = neighbors_choices
        self.neighbors_prob = neighbors_prob
        self.index = np.zeros(C, np.uint32)
        self.choices = list()
        for i in range(C):
            self.choices.append(np.random.choice(self.neighbors_choices[i], size=self.depth, p=self.neighbors_prob[i]))

    def __getitem__(self, arg):
        if self.index[arg] == self.depth:
            self.choices[arg] = np.random.choice(self.neighbors_choices[arg], size=self.depth, p=self.neighbors_prob[arg])
            self.index[arg] = 0
        val = self.choices[arg][self.index[arg]]
        self.index[arg] += 1
        return val

And I use it like this:

# Example with a 3x3 matrix
#  1 2 1
#  3 0 1
#  0 1 0

number_of_walks_per_node = 5
number_of_nodes = 3
length_of_walks = 10

choices = [
            [0, 1, 2],
            [0, 2],
            [1]
          ]
probs =   [
            [0.25, 0.5, 0.25],
            [0.75, 0.25],
            [1]
          ]
randomChoice = RandomChoice(choices, probs, depth=50)
for i in range(number_of_walks_per_node):
    for starting_node in range(number_of_nodes):
        walker_positions = [starting_node]
        for j in range(length_of_walks):
            walker_positions.append( randomChoice[walker_positions[j]])
        print(walker_positions)

The idea here is to take profit of the vector-efficiency of numpy.random.choices against some space in RAM. But this function is still the bottleneck of this code. I think numpy.random.choices spends time checking that the probabilities sum to 1 and that each one of them is greater than 0. Do you have any idea how I could speed up this code ?

Robin
  • 1,531
  • 1
  • 15
  • 35
  • `I think numpy.random.choices spends time checking that the probabilities sum to 1 and that each one of them is greater than 0` doesn't it just pick a random element from a list? The list need not contain a pdf so I don't think it does any of that checking. – Dan Mar 28 '18 at 12:39
  • I don't know if this is up to date, but check here: https://github.com/numpy/numpy/issues/4188 – Robin Mar 28 '18 at 12:46
  • @debzsub it's not possible to run your code. Please could you upload a sample that follows these guidelines https://stackoverflow.com/help/mcve so it's easier to help you. – Jinglesting Mar 28 '18 at 12:47
  • Right sorry, I edit it – Robin Mar 28 '18 at 12:52
  • 2
    only looking at what you have got (ignoring the np.choice for now), I could recommend you use a queue https://docs.python.org/2/library/queue.html or dequeue https://docs.python.org/2/library/collections.html#collections.deque (you'll need to select the appropriate one) rather than a list if you are doing many appends, and then popping the results, since they are considerably faster if you're only using them to store data serially, and you don't care about indexing into them. – Jinglesting Mar 28 '18 at 12:52
  • Good point @Jinglesting ! – Robin Mar 28 '18 at 13:00
  • Here is an example of an optimized 2d ising model https://stackoverflow.com/a/45403017/4045774 . If you build up your code with pure python classes, lists and so on you could expect a performance being at least two orders of magnitude slower than it could be. – max9111 Mar 29 '18 at 08:26

1 Answers1

1

I updated your code to get it to run (see below). Here is the profiling I get:

profiling for 150000 steps on a small graph

Looks like the function overhead on getitem here is proportionally very large, in comparison to everything else, though I expect this is in part because of the toy graph.

If this is the case generally, could refactor your code in a way that you are not calling getitem so frequently. Instead, you could move the code from getitem into the set of nested loops in your script like so (but it will be ugly)...

for i in range(number_of_walks_per_node):
    for starting_node in range(number_of_nodes):
        walkers_positions = [start_node]
        for j in range(length_of_walks):
            if randomChoice.index[walkers_positions[j]] == randomChoice.depth:
                randomChoice.choices[walkers_positions[j]] = np.random.choice(randomChoice.neighbors_choices[walkers_positions[j]], size=randomChoice.depth,
                                                     p=randomChoice.neighbors_prob[walkers_positions[j]])
                randomChoice.index[walkers_positions[j]] = 0
            val = randomChoice.choices[walkers_positions[j]][randomChoice.index[walkers_positions[j]]]
            randomChoice.index[walkers_positions[j]] += 1
            walkers_positions.append( val )

Working version of your code for reference...

import numpy as np

class RandomChoice(object):
    def __init__(self, neighbors_choices, neighbors_prob, depth=50):
        C = len(neighbors_choices)
        self.depth = depth
        self.neighbors_choices = neighbors_choices
        self.neighbors_prob = neighbors_prob
        self.index = np.zeros(C, np.uint32)
        self.choices = list()
        for i in range(C):
            self.choices.append(np.random.choice(self.neighbors_choices[i], size=self.depth, p=self.neighbors_prob[i]))

    def __getitem__(self, arg):
        if self.index[arg] == self.depth:
            self.choices[arg] = np.random.choice(self.neighbors_choices[arg], size=self.depth, p=self.neighbors_prob[arg])
            self.index[arg] = 0
        val = self.choices[arg][self.index[arg]]
        self.index[arg] += 1
        return val

# Example with a 3x3 matrix
#  1 3 1
#  3 0 1
#  0 1 0

choices = [
            [0, 1, 2],
            [0, 2],
            [1]
          ]
probs =   [
            [0.25, 0.5, 0.25],
            [0.75, 0.25],
            [1]
          ]

number_of_walks_per_node = 10
number_of_nodes = 3
start_node = 0
length_of_walks = 5000
randomChoice = RandomChoice(choices, probs, depth=50)
for i in range(number_of_walks_per_node):
    for starting_node in range(number_of_nodes):
        walkers_positions = [start_node]
        for j in range(length_of_walks):
            walkers_positions.append( randomChoice[walkers_positions[j]] )
Jinglesting
  • 509
  • 5
  • 16
  • (I edited the code and get the same results). You're right, the getitem is the actual bottleneck. Could you elaborate on "Instead, you could move the loop inside of a method on the class." ? – Robin Mar 28 '18 at 13:21
  • But still, the getitem is here because I can't find any way of speeding random.choice (I would not use a class that generates buffers of choices if it was fast) – Robin Mar 28 '18 at 13:22
  • Sorry, my mistake, if you want to speed this up, you need to move the get item code out of the class, and into the loop. WRT np.choice, I doubt you will find may ways to speed it up. It's written in C under the hood, and will likely be well optimised by the numpy team. I'll update the answer. – Jinglesting Mar 28 '18 at 13:33
  • Well, I just tested using deque and it helps a lot. I think the problem here is all the checks that parform np.random.choice (would be great to check once, and then call multiple times). Maybe I should check the code and implemt mine without the checks... – Robin Mar 28 '18 at 13:37