1

I have a pandas dataframe that has monthly counts at various hierarchical levels. It is in long format, and I want to convert to wide format, with columns for each level of aggregation.

It is of the following format:

date | country | state | county | population 
01-01| cc1     | s1    | c1     | 5
01-01| cc1     | s1    | c2     | 4
01-01| cc1     | s2    | c1     | 10
01-01| cc1     | s2    | c2     | 11
02-01| cc1     | s1    | c1     | 6
02-01| cc1     | s1    | c2     | 5
02-01| cc1     | s2    | c1     | 11
02-01| cc1     | s2    | c2     | 12
.
.

Now I want to transform this into the following format:

date | country_pop| s1_pop | s2_pop| .. | s1_c1_pop | s1_c2_pop| s2_c1_pop | s2_c2_pop|..

01-01| 30         | 9      | 21    | ...| 5         | 4        | 10         | 11        |..
02-01| 34         | 11     | 23    | ...| 6         | 5        | 11         | 12        |..
.
.

The total number of states is, 4, s1....s4.

And the counties in each state can be labelled c1.... c10 (some states might have less, and I want those columns to be zeros.)

I want to get a time series at each level of aggregation, ordered by the date. How do I get this ?

Dumbo
  • 1,068
  • 2
  • 12
  • 21
  • looks like a `pivot_table`/`groupby` problem and then merge. – Quang Hoang Sep 05 '19 at 14:14
  • you mean: make a pivot table at each level of aggregation with date, count_for_that_level. Then merge all of these individual pivot tables by the date? That seems clunky, is there a cleaner way to do this? – Dumbo Sep 05 '19 at 14:55

1 Answers1

2

Let's do it this way using sum with the level parameter and pd.concat all the dataframes together.

#Aggregate to lowest level of detail
df_agg = df.groupby(['country', 'date', 'state', 'county'])[['population']].sum()

#Reshape dataframe and flatten multiindex column header
df_county = df_agg.unstack([-1, -2])
df_county.columns = [f'{s}_{c}_{p}' for p, c, s in df_county.columns]

#Sum to next level of detail and reshape
df_state = df_agg.sum(level=[0, 1, 2]).unstack()
df_state.columns = [f'{s}_{p}' for p, s in df_state.columns]

#Sum to country level 
df_country = df_agg.sum(level=[0, 1])

#pd.concat horizontally with axis=1
df_out = pd.concat([df_country, df_state, df_county], axis=1).reset_index()

Output:

  country   date  population  s1_population  s2_population  s1_c1_population  \
0     cc1  01-01          30              9             21                 5   
1     cc1  02-01          34             11             23                 6   

   s1_c2_population  s2_c1_population  s2_c2_population  
0                 4                10                11  
1                 5                11                12  
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • what is f in f'{s}_{c}_{p}'. ? in the third line of code ? – Dumbo Sep 05 '19 at 16:29
  • This is using f-string formatting to re-arrange and flatten the multiindex column headers. – Scott Boston Sep 05 '19 at 16:30
  • In the df_country dataframe you will have three level column header after the unstack... So, for p, c, s which stands for Population level, County level and state level... so I use f-string formatting change the position of those levels and flatten to a single level. – Scott Boston Sep 05 '19 at 16:32
  • This is a python 3.6+ feature. Sorry, if you are using python 2, will need to write it differently. – Scott Boston Sep 05 '19 at 16:32
  • what is the regular syntax? I'd appreciate it. – Dumbo Sep 05 '19 at 17:20
  • 1
    Check out this post using .format or map. https://stackoverflow.com/a/43859132/6361531 – Scott Boston Sep 05 '19 at 18:33
  • What if I wanted to aggregate at one more level? Say, world, where the all the country numbers, with multiple countries would add up. Could you modify, and add one more code chunk (based on the current code) to the current answer- I am trying to follow how levels and unstack work, and you seem to grok this- I am trying to learn from the example problem. – Dumbo Sep 06 '19 at 01:52
  • Then you are back to doing multiple `groupby` and concatenating the results together. – Scott Boston Sep 06 '19 at 17:53