My goal is to find a fast solution for implementing the mean difference (Gini) for qualitative data. Since some of the arrays may have millions of values, I look for the fastest implementation. While programming, I am wondering why a for loop is much faster than an implementation with numpy functions, only. This is totally in contrast to what I have learned about Python and numpy so far:
import numpy as np
def gini(array: np.ndarray) -> float:
"""Calculate the Gini index of an array for value_counts of a pd.Series Values.
https://en.wikipedia.org/wiki/Qualitative_variation#MNDif"""
if len(array) == 1:
gini_index = 0.0
elif len(array) > 1:
n = np.sum(array)
k = len(array)
#summation = np.sum(np.triu(np.abs(array[:, None] - array))) # version 1
#summation = np.sum(np.abs(array[:, None] - array) / 2) # version 2
# version3
summation = 0
for i in range(k-1):
summation += np.sum(np.abs(array[i]-array[i+1:]))
gini_index = 1 - (1/(n*(k-1))) * summation
else:
gini_index = np.nan
return gini_index
a = np.ones(10000)
%timeit gini(a)
Result for version1: 2.01 s ± 61.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Result for version2: 1.96 s ± 45.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Result for version3: 166 ms ± 5.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I wonder why the version with the for loop is by far the fastest?