Performance of Numpy Array vs Python List over 1D matrix(Vector)

Question

I want to calculate pi, which summarizes several labels in a specific class / total number of labels in the GaussianMixture model.

tr_y is a pandas data frame

index	labels
0	6
1	5
2	6
3	5
4	6

1000 rows × 1 column.

then I try to compare two approaches:

the first one is using list:

%%timeit
y_list = tr_y.values.flatten().tolist()

>>> 12.3 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)

54.9 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

the second approach is using numpy array:

%%timeit
arr = tr_y.to_numpy()

>>> 4.55 µs ± 92.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
sum([1 for i in arr if i == 5 ])/arr.__len__()

>>> 883 µs ± 48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Updated approach if it converts NumPy to list using tolist() is much faster than the two previous approaches.

arr = tr_y.to_numpy().tolist()

%%timeit
sum([1 for i in arr if i == 5 ])/arr.__len__()

>>> 43.1 µs ± 410 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

source of the last approach

So using lists is faster than using NumPy array.

I have searched for this, and I find that NumPy has to wrap the returned object with a python type (e.g., NumPy.float64 or NumPy.int64 in this case), which takes time if you're iterating item-by-item1. Further proof of this is demonstrated when iterating -- We see that we're alternating between 2 separate IDs while iterating over the array. This means that python's memory allocator and garbage collector work overtime to create new objects and then free them.

A list doesn't have this memory allocator/garbage collector overhead. The objects in the list already exist as python objects (and they'll still exist after iteration), so neither plays any role in the iteration over a list.

My search concludes that if we need to work on multidimensional matrices or do some vectorization, we have to use the NumPy array faster and take less memory. Is that true?

Another thing that I want to calculate is the memory consumption for both the Numpy array and list. Still, I find that sys.sizeOf is not reliable and gives us the size of pointers array and header of the container, which there is more and more consideration. Is there any reliable method to calculate the memory consumption?

another investigation that when I convert NumPy array to list I convert it to row matrix which is uploaded into L1 cache at once rather than col vector which makes a lot of misses inside L1 cach. source

so what if we use a vector in Fortran order?

Using NumPy like that is like dragging your car behind you by hand as you walk to the store - you're not actually using the power of the tools at your disposal. — user2357112, Feb 22 '21 at 10:28
You're doing your iteration at Python level instead of in NumPy code that gets to iterate over machine integers directly. — user2357112, Feb 22 '21 at 10:36
I guess you'll make a big discovery if you do `%timeit np.count_nonzero(arr==5)/arr.size` for your 1000 elements array. — Stef, Feb 22 '21 at 11:37

Loic RW · Accepted Answer · 2021-02-22T11:41:08.003

TL;DR

Use np.sum() like this:

np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)

Testing

Now lets do some some experiments. We will use the following setup:

import pandas as pd
import numpy as np
tr_y = pd.DataFrame({'labels': [6, 5, 6, 5, 6]*200000})

We use a larger dataset to see whether the methods scale well to larger inputs. Here we will have a dataset with 1,000,000 rows. We will try a couple of different methods and see how they perform.

The worst performer is:

sum(tr_y.labels.to_numpy()==5)/len(tr_y)

1.91 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The next option is on average 14 times faster:

y_list = tr_y.to_numpy().tolist()
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)

132 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

After that we get a 1.6 times increase with:

sum(tr_y.labels==5)/len(tr_y)

79.3 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

None of these methods however are optimised with numpy. They use numpy arrays but are bogged down by python's sum(). If we use the optimised NumPy version we get:

np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)

1.36 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This operation was on average 58 times faster than our previous best. This is more like the power of NumPy that we were promised. By using np.sum() instead of python's standard sum(), we are able to do the same operation about 1,400 times faster (1.9 s vs 1.4 ms)

Closing Notes

Since Pandas series are built on NumPy arrays the following code gives very similar performance to our optimal setup.:

np.sum(tr_y.labels==5)/len(tr_y)

1.83 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Unless optimizing your code is essential, I would personally go for this option as it is the clearest to read without losing much performance.

Performance of Numpy Array vs Python List over 1D matrix(Vector)

1 Answers1

TL;DR

Testing

Closing Notes