Converting NumPy array to a set takes too long

Question

I'm trying to execute the following

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(item)

and it takes very long compared to:

x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    item.tolist()

Why does it take much longer to convert a NumPy array to a set than to a list? I mean basically both have complexity O(n)?

converting to a `set` involves some hashing & computing "where" to insert the new element in the set. Converting to `list` is probably just a straightforward copy. — Jean-François Fabre, May 28 '17 at 07:09
What's the shape of your `x`? dtype? I see 3 and 4 element sublists. But yes, `set` is a Python object similar to a dictionary (keys but no value). `tolist` is a compile numpy method. — hpaulj, May 28 '17 at 07:43
let n be the length of item in x. In worth case there could be n item in every sublist. Is there any option to do it faster? — Felix Ha, May 28 '17 at 07:56
What's the purpose of doing `set` on each sublist in `a` and throwing away the result? — hpaulj, May 28 '17 at 08:52
in the main application i have a 2d data set. Next i perform a range query with kd tree, which returns a np array as above. Each sublist has the indices for the neighbours from a point (therefore it could be N in each sublist). But i need the indices for the neighbours from a point as a set for next step in my application. — Felix Ha, May 28 '17 at 09:34
What is the next step of your application? In all likelihood, there is a solution to that that does not require a set. — Eelco Hoogendoorn, May 28 '17 at 15:24
the next step is the intersection between every point to the best point.the best point is the point with the most neighbours. If i took i set it only takes (worst case) O(n) for every point. i already have tried with numpy.intersect1d and with a list. But the performace of set was much better than the two others. (beside the first step). — Felix Ha, May 28 '17 at 16:55

MSeifert · Accepted Answer · 2018-01-21T08:48:52.820

TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

The values 1, 2 and 3 are stored directly in the NumPy array instance.

With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

import numpy as np
arr = np.arange(100000)
lst = list(range(100000))

def iterateover(obj):
    for item in obj:
        pass

%timeit iterateover(arr)
# 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit iterateover(lst)
# 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The set "constructor" just does an iteration over the object.

One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

arr = np.arange(100000)

%timeit list(arr)
# 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit arr.tolist()
# 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Each of these has O(n) runtime complexity but the constant factors are very different.

In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

arr = np.random.randint(0, 1000, (10000, 3))

def tosets(arr):
    for line in arr:
        set(line)

def tolists(arr):
    for line in arr:
        list(line)

def tolists_method(arr):
    for line in arr:
        line.tolist()

def tosets_intermediatelist(arr):
    for line in arr:
        set(line.tolist())

%timeit tosets(arr)
# 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists(arr)
# 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit tolists_method(arr)
# 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit tosets_intermediatelist(arr)
# 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).

In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

import numba as nb
@nb.njit
def tosets_numba(arr):
    for lineno in range(arr.shape[0]):
        set(arr[lineno])

tosets_numba(arr)  # warmup
%timeit tosets_numba(arr)
# 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.

Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

@FelixHa Yes, numba *could* be faster - if you don't need the sets outside your function. I updated the answer :) — MSeifert, May 28 '17 at 22:13
thanks it really speeds up the result!!! , have u an idea how to implement it in cython? — Felix Ha, May 29 '17 at 06:01
@FelixHa Yes, with Cython you can probably speed up the python solutions by a factor of 2 (it reduces the loop overhead and the function call overhead). But without a dedicated data structure (maybe there's some c++ data structure but I haven't checked) for the problem it won't make much sense. If you use IPython you can simply use `%load_ext cython` and then compile a block with `%%cython`. I checked the 4 functions I had in the answer and all were 1.5-2 times faster compared to pure python. But that's a lot slower than numba. — MSeifert, May 29 '17 at 09:25
so does having a numpy array of dtype `object` mean it stores references like python lists do? Would there then be any benefit to doing `array.astype(object)` instead of `array.tolist()`? I've run some empirical tests that would support this but don't know enough about memory management to rule out coincidences. — Tadhg McDonald-Jensen, Jun 02 '17 at 15:53
@TadhgMcDonald-Jensen Well, it's more complicated with `object` arrays. For a 1D array it's storing them like a list would. But for ND-arrays it would be different: with nested lists the outer lists refer to the inner lists and the inner-most-lists would refer to the actual objects - an object array would be one array containing references to all elements (as distinct objects) but the multidimensionality is "emulated" using strides. I think for the `array` -> `set` conversion the `tolist` approach would be faster (at least that's what my computer says). — MSeifert, Jun 02 '17 at 17:05
My shot into the dark: the reason for `arr.tolist()` being twice as fast as `list(arr)` is that `list(arr)` probably uses an iterator-interface for initialization so it doesn't not known the size and needs to dynamically grow the list, but `tolist()` could directly get as much memory as needed. But I'm not 100% sure... — ead, Jan 16 '18 at 11:53
@ead Initially I suspected that too but `list` estimates the final size using `__length_hint__` (since Python 3.4) and that gives the correct size for NumPy arrays. I've used Python 3.5 and 3.6 for the timings so it's unlikely to be the cause. — MSeifert, Jan 16 '18 at 18:59
This, like many of your answers, is a very thorough answer. From time to time I go through your answers and learn a lot from them. I know you don't need this but consider this a small gesture. :) — ayhan, Jan 21 '18 at 21:15

score 1 · Answer 2 · answered Jan 18 '18 at 15:56

1

Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick

from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing

pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
y = pool.map(set,x) # apply the function to your iterable
pool.close()
pool.join()

answered Jan 18 '18 at 15:56

zar3bski

2,773
7
25
58

Interesting, do you have any data (timings) to support the statement that it actually "speeds things up"? – MSeifert Jan 19 '18 at 22:45
http://chriskiehl.com/article/parallelism-in-one-line/ just try it, it is amazing! It changed my whole approach of coding – zar3bski Jan 21 '18 at 17:40
Since I do not think that your task takes long for computational reasons, you could event try a higher int value in ThreadPool() (e.g. for handshakes with distant servers, I use 30) – zar3bski Jan 21 '18 at 17:44
But if I try your code it's 20-500 times **slower** than the original code from the question (depending on the kind of input) so I was wondering what data you used or how you did your timings to justify the "speed things up" statement. – MSeifert Jan 21 '18 at 17:57

Converting NumPy array to a set takes too long

2 Answers2

Linked