1

My goal is to find a fast solution for implementing the mean difference (Gini) for qualitative data. Since some of the arrays may have millions of values, I look for the fastest implementation. While programming, I am wondering why a for loop is much faster than an implementation with numpy functions, only. This is totally in contrast to what I have learned about Python and numpy so far:

import numpy as np

def gini(array: np.ndarray) -> float:
    """Calculate the Gini index of an array for value_counts of a pd.Series Values.
    https://en.wikipedia.org/wiki/Qualitative_variation#MNDif"""

    if len(array) == 1:
        gini_index = 0.0

    elif len(array) > 1:
        n = np.sum(array)
        k = len(array)
        #summation = np.sum(np.triu(np.abs(array[:, None] - array))) # version 1
        #summation = np.sum(np.abs(array[:, None] - array) / 2) # version 2

        # version3
        summation = 0
        for i in range(k-1):
            summation += np.sum(np.abs(array[i]-array[i+1:]))

        gini_index = 1 - (1/(n*(k-1))) * summation

    else:
        gini_index = np.nan

    return gini_index

a = np.ones(10000)
%timeit gini(a)

Result for version1: 2.01 s ± 61.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Result for version2: 1.96 s ± 45.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Result for version3: 166 ms ± 5.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I wonder why the version with the for loop is by far the fastest?

MaGarb
  • 43
  • 6
  • `(array * np.arange(len(array)) - np.add.accumulate(array)).sum()` – Veedrac Apr 17 '20 at 22:05
  • @Veedrac: Thank you very much, this helped a lot, but your code was just close: ```np.sum(array * np.arange(len(array))[::-1]) - np.sum(np.add.accumulate(array[::-1])[:-1])``` 92.3 µs ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) – MaGarb Apr 18 '20 at 08:41

0 Answers0