1

I am working with a large Pandas dataframe that has multiple index levels in both rows and columns, something like:
df
       Metric | Population         Homes           
         Year | 2018  2019  2020   2018  2019  2020
Town   Sector |
----   ------ | ----  ----  ----   ----  ----  ----
  A       1   |  100   110   120     50    52    52
          2   |  200   205   210     80    80    80
          3   |  300   300   300    100   100   100
  B       1   |   50    60    70     20    22    24
          2   |  100   100   100     40    40    40
          3   |  150   140   130     50    47    44

I need to perform calculations for groups of columns, eg. find ratio between population and homes.
Step by step that would be:

# 1. Calculation
R = df["Population"] / df["Homes"]
R
         Year | 2018  2019  2020
Town   Sector | 
----   ------ | ----  ----  ----
  A       1   |  2.0   2.1   2.3
          2   |  2.5   2.6   2.6
          3   |  3.0   3.0   3.0
  B       1   |  2.5   2.7   2.9
          2   |  2.5   2.5   2.5
          3   |  3.0   3.0   3.0

# 2. Re-build multiindex for columns (there are various levels, showing only one here)
R = pd.concat([R],keys=["Ratio"],axis=1)
R
              | Ratio
         Year | 2018  2019  2020
Town   Sector | 
----   ------ | ----  ----  ----
  A       1   |  2.0   2.1   2.3
          2   |  2.5   2.6   2.6
          3   |  3.0   3.0   3.0
  B       1   |  2.5   2.7   2.9
          2   |  2.5   2.5   2.5
          3   |  3.0   3.0   3.0

# 3. Concat previous calculation to the main dataframe
df = pd.concat([df,R],axis=1,sort=True) # I need the sort=True to avoid a performance Warning
df
       Metric | Population         Homes              Ratio
         Year | 2018  2019  2020   2018  2019  2020   2018  2019  2020
Town   Sector |                                                       
----   ------ | ----  ----  ----   ----  ----  ----   ----  ----  ----
  A       1   |  100   110   120     50    52    52    2.0   2.1   2.3
          2   |  200   205   210     80    80    80    2.5   2.6   2.6
          3   |  300   300   300    100   100   100    3.0   3.0   3.0
  B       1   |   50    60    70     20    22    24    2.5   2.7   2.9
          2   |  100   100   100     40    40    40    2.5   2.5   2.5
          3   |  150   140   130     50    47    44    3.0   3.0   3.0

I can write the above expresions in a single line, but as I mentioned I have various index levels and it becomes complicated... Is there a way to do something simpler?
I would have guessed:

df["Ratio"] = df["Population"]/df["Homes"]

But it throws a "ValueError: Item must have length equal to number of levels."
Thanks!

ruben
  • 21
  • 4
  • 1
    kindly provide the source dataframe as a dictionary : ``df.to_dict('records')``. Easier to recreate since your source data is a MultiIndex. also add an example for the different index levels, as it is not obvious from the data shared – sammywemmy Jan 24 '22 at 02:47

2 Answers2

1

Let's do some dataframe reshaping like this:

import pandas as pd
from io import StringIO

# Create your input dataframe
csvdata = StringIO("""100   110   120     50    52    52
  200   205   210     80    80    80
  300   300   300    100   100   100
  50    60    70     20    22    24
  100   100   100     40    40    40
  150   140   130     50    47    44""")

df = pd.read_csv(csvdata, sep='\s\s+', header=None, engine='python')
df.index = pd.MultiIndex.from_product([[*'AB'],[1,2,3]])
df.columns = pd.MultiIndex.from_product(['Population Homes'.split(' '), 
                                         [2018, 2019, 2020]])

df_out=df.stack().eval('Ratio = Population / Homes').unstack().round(1)
print(df_out)

Output:

    Homes           Population           Ratio          
     2018 2019 2020       2018 2019 2020  2018 2019 2020
A 1    50   52   52        100  110  120   2.0  2.1  2.3
  2    80   80   80        200  205  210   2.5  2.6  2.6
  3   100  100  100        300  300  300   3.0  3.0  3.0
B 1    20   22   24         50   60   70   2.5  2.7  2.9
  2    40   40   40        100  100  100   2.5  2.5  2.5
  3    50   47   44        150  140  130   3.0  3.0  3.0

Using stack, eval and unstack.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
0

I've later found this thread, and it's led me in the right direction (or at least what I had in my mind...):

pandas dataframe select columns in multiindex

In order to operate with all columns under the same index level, I had in mind something like this (doesn't work):

df["Ratio"] = df["Population"]/df["Homes"]

From the above thread and pandas doc on slicers (https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#using-slicers) I got to the following expression, which does work:

df.loc[:,("Ratio",slice(None)] = df.loc[:,("Population",slice(None)] / df.loc[:,("Homes",slice(None)]

Changes needed were:

  • use .loc
  • within the .loc[...], first need to get all rows with the colon ":"
  • then use brackets to indicate the multiindex levels
  • and use slide(None) to get all values of the index at the last level (for the example above...)
ruben
  • 21
  • 4