30

i'm calculating Gini coefficient (similar to: Python - Gini coefficient calculation using Numpy) but i get an odd result. for a uniform distribution sampled from np.random.rand(), the Gini coefficient is 0.3 but I would have expected it to be close to 0 (perfect equality). what is going wrong here?

def G(v):
    bins = np.linspace(0., 100., 11)
    total = float(np.sum(v))
    yvals = []
    for b in bins:
        bin_vals = v[v <= np.percentile(v, b)]
        bin_fraction = (np.sum(bin_vals) / total) * 100.0
        yvals.append(bin_fraction)
    # perfect equality area
    pe_area = np.trapz(bins, x=bins)
    # lorenz area
    lorenz_area = np.trapz(yvals, x=bins)
    gini_val = (pe_area - lorenz_area) / float(pe_area)
    return bins, yvals, gini_val

v = np.random.rand(500)
bins, result, gini_val = G(v)
plt.figure()
plt.subplot(2, 1, 1)
plt.plot(bins, result, label="observed")
plt.plot(bins, bins, '--', label="perfect eq.")
plt.xlabel("fraction of population")
plt.ylabel("fraction of wealth")
plt.title("GINI: %.4f" %(gini_val))
plt.legend()
plt.subplot(2, 1, 2)
plt.hist(v, bins=20)

for the given set of numbers, the above code calculates the fraction of the total distribution's values that are in each percentile bin.

the result:

enter image description here

uniform distributions should be near "perfect equality" so the lorenz curve bending is off.

Community
  • 1
  • 1
mvd
  • 1,167
  • 3
  • 18
  • 30

8 Answers8

36

This is to be expected. A random sample from a uniform distribution does not result in uniform values (i.e. values that are all relatively close to each other). With a little calculus, it can be shown that the expected value (in the statistical sense) of the Gini coefficient of a sample from the uniform distribution on [0, 1] is 1/3, so getting values around 1/3 for a given sample is reasonable.

You'll get a lower Gini coefficient with a sample such as v = 10 + np.random.rand(500). Those values are all close to 10.5; the relative variation is lower than the sample v = np.random.rand(500). In fact, the expected value of the Gini coefficient for the sample base + np.random.rand(n) is 1/(6*base + 3).

Here's a simple implementation of the Gini coefficient. It uses the fact that the Gini coefficient is half the relative mean absolute difference.

def gini(x):
    # (Warning: This is a concise implementation, but it is O(n**2)
    # in time and memory, where n = len(x).  *Don't* pass in huge
    # samples!)

    # Mean absolute difference
    mad = np.abs(np.subtract.outer(x, x)).mean()
    # Relative mean absolute difference
    rmad = mad/np.mean(x)
    # Gini coefficient
    g = 0.5 * rmad
    return g

(For some more efficient implementations, see More efficient weighted Gini coefficient in Python)

Here's the Gini coefficient for several samples of the form v = base + np.random.rand(500):

In [80]: v = np.random.rand(500)

In [81]: gini(v)
Out[81]: 0.32760618249832563

In [82]: v = 1 + np.random.rand(500)

In [83]: gini(v)
Out[83]: 0.11121487509454202

In [84]: v = 10 + np.random.rand(500)

In [85]: gini(v)
Out[85]: 0.01567937753659053

In [86]: v = 100 + np.random.rand(500)

In [87]: gini(v)
Out[87]: 0.0016594595244509495
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • why do we get different values for ``gini(np.random.rand(500))``? is there an error in my implementation or is it within noise of different calculation methods (i use trapz fitting)? – mvd Sep 15 '16 at 14:35
  • You are computing the Gini coefficient of a random *sample*. The value will be different for different samples. – Warren Weckesser Sep 15 '16 at 14:40
  • 1
    *"is there an error in my implementation..."* Try the data shown here: http://peterrosenmai.com/lorenz-curve-graphing-tool-and-gini-coefficient-calculator Do you get 0.7202 for the Gini coefficient? – Warren Weckesser Sep 15 '16 at 14:50
  • 1
    FYI here's an O(n) implementation of the Gini coefficient, which also takes weights: https://stackoverflow.com/a/48999797/1840471 – Max Ghenis Feb 27 '18 at 02:04
  • Nice. I see `sxw = np.argsort(x)` in there, which means the function is at best O(n*log(n)), but that's still better than O(n**2)! – Warren Weckesser Feb 27 '18 at 02:27
  • I've been comparing a lot of different Python implementations out there, and so far this is looking to be the best one that I can find. It uses the most consistent and referenced version of the Gini calculation, the answer matches simulations that I've done using other techniques, and it's very easy to follow and has great intuition that other solutions don't offer. Well done!!! A++ Thanks for writing this! – yeamusic21 Sep 30 '20 at 21:39
  • He/she is getting the right answer. The Gini Coefficient of the uniform distribution is not 0 "perfect equality", but `(b-a) / (3*(b+a))`. In your case, `b = 1`, and `a = 0`, so `Gini = 1/3`. The only distributions with perfect equality are the Kroneker and the Dirac deltas. Equality means "all the same", not "all equally probable". This answer should be **downvoted**. – Pablo MPA Feb 03 '22 at 12:07
  • 1
    @PabloMPA, everything you say *agrees* with this answer, so why should it be downvoted? The formula that I gave for the expected Gini coefficient, `1/(6*base + 3)`, is for samples generated by the expression `base + np.random.rand(n)`. In that case, `a = base` and `b = base + 1`, so `(b - a)/(3*(b+a)) = 1/(3*(2*base + 1) = 1/(6*base + 3)`. – Warren Weckesser Feb 03 '22 at 13:05
  • You are right, @WarrenWeckesser, my fault. – Pablo MPA Feb 03 '22 at 14:15
18

A slightly faster implementation (using numpy vectorization and only computing each difference once):

def gini_coefficient(x):
    """Compute Gini coefficient of array of values"""
    diffsum = 0
    for i, xi in enumerate(x[:-1], 1):
        diffsum += np.sum(np.abs(xi - x[i:]))
    return diffsum / (len(x)**2 * np.mean(x))

Note: x must be a numpy array.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
Ulf Aslak
  • 7,876
  • 4
  • 34
  • 56
  • I've found this to work as well, but not quite as intuitive as the other solution. – yeamusic21 Sep 30 '20 at 21:43
  • @yeamusic21 thanks for validating. Sure, the speed costs some readability. – Ulf Aslak Oct 01 '20 at 10:44
  • 2
    This seems more resilient to clusters than the original function of the post. Applying both functions to an input of [50,50,50,50,50,50,1,1,1,1,1,1] gives a gini coefficient of 0.48 with this function but only 0.07 with the original function, suggesting equality. They also deal with single outliers very differently – Ricardo Guerreiro Sep 16 '21 at 13:15
  • @RicardoGuerreiro Thanks for pointing this out. Note that 0.48 is also the *correct* coefficient for the input you provided. So there must be some bug in the original answer. – Ulf Aslak Sep 16 '21 at 20:31
  • Does x need to be sorted here? I think not, but always worth checking :) – drevicko Sep 22 '21 at 04:32
  • @drevicko Nope, no need to sort. – Ulf Aslak Sep 23 '21 at 11:20
  • 1) do people ever normalize by x.shape[0] / (x.shape[0] - 1) so that the calculated coefficient maps into [0,1]? 2) could you help me understand what this code is doing, or the functional form that this calculation is mapping onto? sorry, still having trouble understanding it. – shaha Nov 12 '21 at 16:35
  • 1
    @shaha Dividing by mean(x) takes care of putting the output value within [0,1]. The functional form is \sum_i\sum_j\abs{x_i-x_j} / (2n^2 \bar{x}). In plain old English it's: "the mean absolute difference of all value pairs, normalized by the value average (to put it between 0-1)". – Ulf Aslak Nov 12 '21 at 21:07
  • I'm observing some really strange behaviour when I feed in a list of more than a 1000 items. Below that it works perfectly fine. Above 1'000 integers (appears to work fine with floating point numbers) it sometimes gives a negative number and it appears to shrink slowly. At 1050 items I observed about 5% negatives, by 1100 it was 50-50 and by 1120 it's 98%. By this point it's also 10% smaller than I expect. Any ideas why this could be? (My current work-around is feeding in floats, which works fine even at 10'000 items) – Josh Jan 02 '22 at 15:10
  • @Josh strange... I cannot reproduce that. Which version of Python are you on? Would like to see a code example that triggers this inconsistency, if you can link to it. – Ulf Aslak Jan 04 '22 at 10:05
6

Gini coefficient is the area under the Lorence curve, usually calculated for analyzing the distribution of income in population. https://github.com/oliviaguest/gini provides simple implementation for the same using python.

bhartii
  • 71
  • 1
  • 1
  • It's a gentle implementation using a lib with PR and proven code. I'm removing the downvote here. – Flavio Oct 29 '19 at 08:25
2

A quick note on the original methodology:

When calculating Gini coefficients directly from areas under curves with np.traps or another integration method, the first value of the Lorenz curve needs to be 0 so that the area between the origin and the second value is accounted for. The following changes to G(v) fix this:

yvals = [0]
for b in bins[1:]:

I also discussed this issue in this answer, where including the origin in those calculations provides an equivalent answer to using the other methods discussed here (which do not need 0 to be appended).

In short, when calculating Gini coefficients directly using integration, start from the origin. If using the other methods discussed here, then it's not needed.

1

Note that gini index is currently present in skbio.diversity.alpha as gini_index. It might give a bit different result with examples mentioned above.

Leo
  • 199
  • 2
  • 4
1

You are getting the right answer. The Gini Coefficient of the uniform distribution is not 0 "perfect equality", but (b-a) / (3*(b+a)). In your case, b = 1, and a = 0, so Gini = 1/3.

The only distributions with perfect equality are the Kroneker and the Dirac deltas. Remember that equality means "all the same", not "all equally probable".

Pablo MPA
  • 61
  • 1
  • 3
1

There were some issues with the previous implementations. They never gave the gini index = 1 for perfectly sparse data.

example:

def gini_coefficient(x):
        """Compute Gini coefficient of array of values"""
        diffsum = 0
        for i, xi in enumerate(x[:-1], 1):
            diffsum += np.sum(np.abs(xi - x[i:]))
        return diffsum / (len(x)**2 * np.mean(x))
    
    gini_coefficient(np.array([0, 0, 1]))

gives the answer 0.666666. That happens because of the implied "integration scheme" it uses.

Here is another variant that bypasses the issue, although it is computationally heavier:

import numpy as np
from scipy.interpolate import interp1d

def gini(v, n_new = 1000):
    """Compute Gini coefficient of array of values"""
    v_abs = np.sort(np.abs(v))
    cumsum_v = np.cumsum(v_abs)
    n = len(v_abs)
    vals = np.concatenate([[0], cumsum_v/cumsum_v[-1]])
    x = np.linspace(0, 1, n+1)
    f = interp1d(x=x, y=vals, kind='previous')
    xnew = np.linspace(0, 1, n_new+1)
    dx_new = 1/(n_new)
    vals_new = f(xnew)
    return 1 - 2 * np.trapz(y=vals_new, x=xnew, dx=dx_new)

gini(np.array([0, 0, 1]))

it gives 0.999 output, which is closer to what one wants to have =)

0

Here's an implementation that is better for small integer values. It saves all the floating point calculations for the end and is thus more accurate. Not intended for large inputs.

def gini_coefficient(x):
    x = sorted(x)
    n = len(x)
    s = sum(x)
    d = n * s
    G = sum(xi * (n - i) for i, xi in enumerate(x))
    return (d + s - 2 * G) / d