12

Per https://stackoverflow.com/a/48981834/1840471, this is an implementation of the weighted Gini coefficient in Python:

import numpy as np
def gini(x, weights=None):
    if weights is None:
        weights = np.ones_like(x)
    # Calculate mean absolute deviation in two steps, for weights.
    count = np.multiply.outer(weights, weights)
    mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
    rmad = mad / np.average(x, weights=weights)
    # Gini equals half the relative mean absolute deviation.
    return 0.5 * rmad

This is clean and works well for medium-sized arrays, but as warned in its initial suggestion (https://stackoverflow.com/a/39513799/1840471) it's O(n2). On my computer that means it breaks after ~20k rows:

n = 20000  # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))

Can this be adjusted to work for larger datasets? Mine is ~150k rows.

janw
  • 8,758
  • 11
  • 40
  • 62
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
  • This may be of use: https://www.jstatsoft.org/article/view/v015i01/0 – Alex Feb 27 '18 at 01:15
  • For a related statistic, i.e. the weighted coefficient of variation, see [this answer](https://stackoverflow.com/a/53748541/832230). In its calculation, it uses the standard deviation, not the absolute deviation. – Asclepius Dec 12 '18 at 17:54

2 Answers2

17

Here is a version which is much faster than the one you provided above, and also uses a simplified formula for the case without weight to get even faster results in that case.

def gini(x, w=None):
    # The rest of the code requires numpy arrays.
    x = np.asarray(x)
    if w is not None:
        w = np.asarray(w)
        sorted_indices = np.argsort(x)
        sorted_x = x[sorted_indices]
        sorted_w = w[sorted_indices]
        # Force float dtype to avoid overflows
        cumw = np.cumsum(sorted_w, dtype=float)
        cumxw = np.cumsum(sorted_x * sorted_w, dtype=float)
        return (np.sum(cumxw[1:] * cumw[:-1] - cumxw[:-1] * cumw[1:]) / 
                (cumxw[-1] * cumw[-1]))
    else:
        sorted_x = np.sort(x)
        n = len(x)
        cumx = np.cumsum(sorted_x, dtype=float)
        # The above formula, with all weights equal to 1 simplifies to:
        return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n

Here is some test code to check we get (mostly) the same results:

>>> x = np.random.rand(1000000)
>>> w = np.random.rand(1000000)
>>> gini_max_ghenis(x, w)
0.33376310938610521
>>> gini(x, w)
0.33376310938610382

But the speed is very different:

%timeit gini(x, w)
203 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit gini_max_ghenis(x, w)
55.6 s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you remove the pandas ops from the function, it is already much faster:

%timeit gini_max_ghenis_no_pandas_ops(x, w)
1.62 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you want to get the last drop of performance you could use numba or cython but that would only gain a few percent because most of the time is spent in sorting.

%timeit ind = np.argsort(x); sx = x[ind]; sw = w[ind]
180 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

edit: gini_max_ghenis is the code used in Max Ghenis' answer

  • What are `gini_slow` and `gini_slow2`? – Mr. T Mar 30 '18 at 10:08
  • gini_slow is the version posted by Max Ghenis above. gini_slow2 is that function avoiding all pandas based operations (ie do not create series and thus use normal indexing instead of .iloc) – Gaëtan de Menten Mar 30 '18 at 14:41
  • @A-B-B I just took a known algorithm (which incidentally matches the code given by Max Ghenis) and produced a vectorized numpy version of it. The data I use it on never has negatives so I never did the research. From a quick glance at the page linked in Max answer I would say it is likely it does, but I don't have the time or inclination to actually check that. Please comment back if you do so ! – Gaëtan de Menten Dec 10 '18 at 08:44
  • Are you aware if there are fundimentally different approaches to Gini calculation? Your function matches the output for R's `reldist::gini()` but is very different to `DescTools::Gini`. I'm wondering how this could happen for peer-reviewed code. E.g. [1,1,1,1,1000] is 0.796 for you, but 0.995 for DescTools. Perhaps the methods only diverge in the small-_n_ space. – geotheory Jul 10 '20 at 00:35
  • 2
    @geotheory I assume this is because DescTools's version is unbiased by default (multiplied by n/(n-1). 0.796*5/4 = 0.995 – Gaëtan de Menten Jul 10 '20 at 20:56
  • That will be it :) – geotheory Jul 11 '20 at 00:56
  • Is there an implementation that does not require the input to be sorted or to do an all pair comparison of all values? Ideally a streaming algorithm or something that could be parallelized and would work with very large inputs that do not necessarily fit in memory. – Marsellus Wallace Dec 18 '20 at 22:37
  • If you use the "nested sum" definition of Gini, you can easily avoid sorting but that would not satisfy your "all pair comparison" criterion, wouldn't help with big data and would change the complexity to O(N²) instead of O(N log N). I don't know of any algorithm for that. It does not meant there is not one, I just don't know (and I don't see anything obvious), sorry. – Gaëtan de Menten Dec 19 '20 at 10:34
5

Adapting the StatsGini R function from here:

import numpy as np
import pandas as pd

def gini(x, w=None):
    # Array indexing requires reset indexes.
    x = pd.Series(x).reset_index(drop=True)
    if w is None:
        w = np.ones_like(x)
    w = pd.Series(w).reset_index(drop=True)
    n = x.size
    wxsum = sum(w * x)
    wsum = sum(w)
    sxw = np.argsort(x)
    sx = x[sxw] * w[sxw]
    sw = w[sxw]
    pxi = np.cumsum(sx) / wxsum
    pci = np.cumsum(sw) / wsum
    g = 0.0
    for i in np.arange(1, n):
        g = g + pxi.iloc[i] * pci.iloc[i - 1] - pci.iloc[i] * pxi.iloc[i - 1]
    return g

This works for large vectors, at least up to 10M rows:

n = 1e7
gini(np.random.rand(n), np.random.rand(n))  # Takes ~15s.

It also produces the same result as the function provided in the question, for example giving 0.2553 for this example:

gini(np.array([3, 1, 6, 2, 1]), np.array([4, 2, 2, 10, 1]))
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132