I want to calculate pi, which summarizes several labels in a specific class / total number of labels in the GaussianMixture model.
tr_y is a pandas data frame
index | labels |
---|---|
0 | 6 |
1 | 5 |
2 | 6 |
3 | 5 |
4 | 6 |
1000 rows × 1 column.
then I try to compare two approaches:
- the first one is using list:
%%timeit
y_list = tr_y.values.flatten().tolist()
>>> 12.3 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)
54.9 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
- the second approach is using numpy array:
%%timeit
arr = tr_y.to_numpy()
>>> 4.55 µs ± 92.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
sum([1 for i in arr if i == 5 ])/arr.__len__()
>>> 883 µs ± 48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Updated approach if it converts NumPy to list using tolist()
is much faster than the two previous approaches.
arr = tr_y.to_numpy().tolist()
%%timeit
sum([1 for i in arr if i == 5 ])/arr.__len__()
>>> 43.1 µs ± 410 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
source of the last approach
So using lists is faster than using NumPy array.
I have searched for this, and I find that NumPy has to wrap the returned object with a python type (e.g., NumPy.float64 or NumPy.int64 in this case), which takes time if you're iterating item-by-item1. Further proof of this is demonstrated when iterating -- We see that we're alternating between 2 separate IDs while iterating over the array. This means that python's memory allocator and garbage collector work overtime to create new objects and then free them.
A list doesn't have this memory allocator/garbage collector overhead. The objects in the list already exist as python objects (and they'll still exist after iteration), so neither plays any role in the iteration over a list.
My search concludes that if we need to work on multidimensional matrices or do some vectorization, we have to use the NumPy array faster and take less memory. Is that true?
Another thing that I want to calculate is the memory consumption for both the Numpy array and list. Still, I find that sys.sizeOf
is not reliable and gives us the size of pointers array and header of the container, which there is more and more consideration. Is there any reliable method to calculate the memory consumption?
another investigation that when I convert NumPy array to list I convert it to row matrix which is uploaded into L1 cache at once rather than col vector which makes a lot of misses inside L1 cach. source
so what if we use a vector in Fortran order?