I have 5000 observations that are clustered into 10 clusters. Each cluster have 1000 true observations. The real life observations are 1000 in each cluster. However, after I have ran my clustering algorithm, it looks like this:
Cluster #, true members, clustered members
0, 1000, 435
1, 1000, 234
2, 1000, 167
3, 1000, 654
4, 1000, 0
In other words, cluster 0 should have 1000 members, but of those, only 435 are correctly added to that cluster by my algorithm. The difference between 5000 and the ones in the clusters are placed in the wrong cluster.
I would like to calculate the Gini coefficient, and have found the following code:
def gini_ind(Number, Total):
return (1-(((Number/Total)**2)+(((Total-Number)/Total)**2)))
It seems to work fine on the tests I have tried. However, no dataset that I have found look like mine.
So my question is how do I calculate the Gini coefficient?
If I do the following, I get these Gini coefficients for each cluster:
gini_ind(435,1000) -> 0.49155
gini_ind(234,1000) -> 0.3584
gini_ind(167,1000) -> 0.2782
gini_ind(654,1000) -> 0.4525
gini_ind(0,1000) -> 0
Is that the correct Gini coefficient for each of the clusters?
And to get the average Gini coefficient; is that just the average: (0.49155+0.3584+0.2782+0.4525+0)/5 ?