1

I wanted to implement numpy.random.choice (except the replace argument of it) to see how it works.

This is what I came up with:

from random import uniform
from math import fsum

def select(array, total_count, probability):
    probability_accumulative = []
    last_element = 0
    for i in range(len(probability)):
        probability_accumulative.append(last_element + probability[i])
        last_element = probability_accumulative[i]

    result = []

    if(len(array) != len(probability)):
        raise ValueError("array and probability must have the same size.")
    elif(fsum(probability) != 1.0):
        raise ValueError("probabilities do not sum to 1.")
    else:
        for i in range(total_count):
            rand = uniform(0, 1)
            for j in range(len(probability_accumulative)):
                if(rand < probability_accumulative[j]):
                    result.append(array[j])
                    break

    return result

It seemed to work pretty well, so I decided to write another script to check how much slower my implementation was, compared to numpy.random.choice.

This is the test script I wrote for it:

from random_selection import select
from collections import Counter
from numpy.random import choice
from time import time

def test(array, total_count, probability, method):
    methods = {
        "numpy.random.choice": choice(array, total_count, p=probability),
        "random_selection.select": select(array, total_count, probability)
    }

    if(method in methods):
        probability_dict = {}
        rand_items = methods[method]
        items_counter = Counter(rand_items)

        for item, count in items_counter.most_common():
            probability_dict[item] = f"{100 * count / total_count:.1f}%"
        return probability_dict
    else:
        raise ValueError(f"Method {method} has not been defined.")


def main():
    total_count = 1000000
    array = ['a', 'b', 'c', 'd']
    probability = [0.7, 0.1, 0.1, 0.1]

    print(f"array: {array}")
    print(f"probability: {probability}")
    print(f"size: {total_count}")

    print()

    print('random_selection.select: ')
    start_time = time()
    result = test(array, total_count, probability, 'random_selection.select')
    end_time = time()
    print(result)
    print(f"{(end_time - start_time):.4f} s")

    print()

    print('numpy.random.choice: ')
    start_time = time()
    result = test(array, total_count, probability, 'numpy.random.choice')
    end_time = time()
    print(result)
    print(f"{(end_time - start_time):.4f} s")


if __name__ == "__main__":
    main()

I was quite surprised to see that my implementation was faster!

This is the result for an array size of a million:

array: ['a', 'b', 'c', 'd']
probability: [0.7, 0.1, 0.1, 0.1]
size: 1000000

random_selection.select:
{'a': '70.0%', 'c': '10.0%', 'd': '10.0%', 'b': '10.0%'}
2.5119 s

numpy.random.choice:
{'a': '70.0%', 'b': '10.0%', 'd': '10.0%', 'c': '10.0%'}
3.1098 s

And if I increase the size to 10 million, the difference becomes more noticeable:

array: ['a', 'b', 'c', 'd']
probability: [0.7, 0.1, 0.1, 0.1]
size: 10000000

random_selection.select:
{'a': '70.0%', 'b': '10.0%', 'd': '10.0%', 'c': '10.0%'}
25.6174 s

numpy.random.choice:
{'a': '70.0%', 'b': '10.0%', 'c': '10.0%', 'd': '10.0%'}
31.8087 s

Why is that?

Amir Shabani
  • 3,857
  • 6
  • 30
  • 67
  • well, for starters, `numpy` is meant to work with *arrays* not `list` objects. – juanpa.arrivillaga Nov 22 '19 at 19:22
  • @juanpa.arrivillaga Aren't [arrays represented as lists](https://stackoverflow.com/a/1514557/6282576) in Python? I believe I understand the difference between an array and a list; a `list` can have mixed-typed elements, but an `array` can't, right? But we don't have arrays built-in in Python. One must import the [`array`](https://docs.python.org/2/library/array.html) module. I don't understand how does that make this implementation faster. – Amir Shabani Nov 22 '19 at 19:31
  • 1
    In the context of `numpy`, saying "an array" means an instance of `numpy.ndarray`. You can also use the term "array" more casually, and in that sense, a Python list is a lot like an array. But they're not numpy arrays. – Blckknght Nov 22 '19 at 19:34
  • 1
    No, *arrays are not represented as lists*. When people talk about arrays in Python, they are usually talking about `numpy.ndarray` objects, or arrays from the `array` built-in module. They are both different than `list` object. But I think someone found a problem in how you are profiling it anyway. – juanpa.arrivillaga Nov 22 '19 at 19:34

3 Answers3

7

Your testing code doesn't do what you expect it to do. The test function always calls both of your two random selection functions. Your timing only detects the difference in performance of your analysis code on the results that correspond to the requested function.

The issue is with these lines:

methods = {
    "numpy.random.choice": choice(array, total_count, p=probability),
    "random_selection.select": select(array, total_count, probability)
}

These unconditionally call the choice and select functions, and put the returned values into the dictionary. That's almost certainly not what you expect. You might want to put a lambda function into the dictionary, that calls the desired function with the appropriate arguments when it is called.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
0

This is not necessarily surprising. Python's built in random library is more lightweight than the numpy.random library and so we would expect a simple implementationm based on the random library to be slightly faster. There is a more in depth explanation here.

Connor
  • 1
  • 4
0

As everyone commented, the conversion from list to np.ndarray that occurs when you call any numpy function on a list must be the time-consuming process. If you try this with np.ndarray objects directly, don't hope to beat a Cython-based library like numpy.