0

So, I came across an interesting bar chart enter image description here and I found the underlying data here and I am attempting to recreate how the data has been grouped both by range bins (I have used pd.cut) and by country.

Here is the code I have attempted so far but I get errors, the (errorneous) lines are commented out

import pandas as pd

## csv file in zip http://ec.europa.eu/eurostat/cache/GISCO/geodatafiles/GEOSTAT-grid-POP-1K-2011-V2-0-1.zip

url="C:/Users/Simon/Downloads/GEOSTAT-grid-POP-1K-2011-V2-0-1/Version 2_0_1/GEOSTAT_grid_POP_1K_2011_V2_0_1.csv"
whole=pd.read_csv(url, low_memory=False)

populationDensity=whole[['TOT_P','CNTR_CODE']]


## trying to replicate graph here http://www.centreforcities.org/wp-content/uploads/2018/04/18-04-16-Square-kilometre-units-of-land-by-population.png
## which aggregates the records by brackets


# https://stackoverflow.com/questions/25010215/pandas-groupby-how-to-compute-counts-in-ranges#answer-25010952
ranges = [0,10000,15000,20000,25000,30000,35000,40000,45000,1000000]
bins=pd.cut(populationDensity['TOT_P'],ranges)



#print(bins)

## the following fails with error :
## AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy' objects, try using the 'apply' method
#print (populationDensity.groupby(['CNTR_CODE']).groupby(bins).count())

## the following fails with error :
## TypeError: 'Series' objects are mutable, thus they cannot be hashed
print (populationDensity.groupby(['CNTR_CODE'],pd.cut(populationDensity['TOT_P'],ranges)).count())

#relevant https://stackoverflow.com/questions/21441259/pandas-groupby-range-of-values#answer-21441621

I've only just started using pandas. I will try again tomorrow, in the meantime if anyone knows ...

S Meaden
  • 8,050
  • 3
  • 34
  • 65

1 Answers1

1

Change:

print (populationDensity.groupby(['CNTR_CODE'],pd.cut(populationDensity['TOT_P'],ranges)).count())

to

print (populationDensity.groupby(['CNTR_CODE', pd.cut(populationDensity['TOT_P'],ranges)]).count())
                                            ^                                           ^

because groupby parameter by working with multiple columns names, combination column name and Series or multiple Series in list:

by : mapping, function, label, or list of labels

Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted a (single) key.

Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252