0

I am quite new to pandas. Basically, I have 10 different type of data for different firms in 10 dfs. Eg Total Assets, AUM, etc.
For each type of data, there could have high or low importance: H, or L.
For each type of data, there could have 3 categories: Cat1, Cat2, Cat3.

For H importance, I need to analyse the data by the 3 categories. Same for L importance.

I am thinking of adding a mulit-index for each column of data after merging the 10 dfs. Is that possible?

Current State


**df_1**

      |Total Assets|
Firm 1| 100        |
Firm 2| 200        |
Firm 3| 300        |

**df_2**

      |AUMS    |
Firm 1| 300    |
Firm 2| 3400   |
Firm 3| 800    |
Firm 4| 800    |

and so on until df_10. Also the firms for all the df could differ.


Desired Output

**Merged_df**

Importance| L         | H    |
Category | Cat1       | Cat2 |
         |Total Assets| AUMs |
Firm 1   | 100        | 300  |
Firm 2   | 200        | 3400 |
Firm 3   | 300        | 800  |
Firm 4   | NaN        | 800  |


Next, I need to do a Groupby "Importance" and "Category". Any other solution besides Multi-indexing is welcome. Thank you!

Peter
  • 353
  • 3
  • 10

1 Answers1

1

We can concat on axis=1 with MultiIndex keys:

dfs = [df1, df2]
merged_df = pd.concat(
    dfs, axis=1,
    keys=pd.MultiIndex.from_arrays([
        ['L', 'H'],       # Top Level Keys
        ['Cat1', 'Cat2']  # Second Level Keys
    ], names=['Importance', 'Category'])
)

merged_df:

Importance            L     H
Category           Cat1  Cat2
           Total Assets  AUMS
Firm 1            100.0   300
Firm 2            200.0  3400
Firm 3            300.0   800
Firm 4              NaN   800

CategoricalDtype can be used to establish ordering:

dfs = [df1, df2]
# Specify Categorical Types
# These lists should contain _only_ the unique categories
# in the desired order
importance_type = pd.CategoricalDtype(categories=['H', 'L'], ordered=True)
category_type = pd.CategoricalDtype(categories=['Cat1', 'Cat2'], ordered=True)


# Keys should contain the _complete_ list of _all_ columns
merged_df = pd.concat(
    dfs, axis=1,
    keys=pd.MultiIndex.from_arrays([
        pd.Series(['L', 'H'],            # Top Level Keys
                  dtype=importance_type),
        pd.Series(['Cat1', 'Cat2'],      # Second Level Keys
                  dtype=category_type)
    ], names=['Importance', 'Category'])
)

Then sort_index can be used and it will work as expected. H before L, etc.

# Sorting Now Works As Expected
merged_df = merged_df.sort_index(level=[0, 1], axis=1)

merged_df:

Importance     H            L
Category    Cat2         Cat1
            AUMS Total Assets
Firm 1       300        100.0
Firm 2      3400        200.0
Firm 3       800        300.0
Firm 4       800          NaN

DataFrames:

import pandas as pd

df1 = pd.DataFrame({
    'Total Assets': {'Firm 1': 100, 'Firm 2': 200, 'Firm 3': 300}
})

df2 = pd.DataFrame({
    'AUMS': {'Firm 1': 300, 'Firm 2': 3400, 'Firm 3': 800, 'Firm 4': 800}
})
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Thanks for the reply! 1) Is there a way to create the multi-index on an individual df like df_1? 2) How do I present the merged_df with re-ordering of the columns? I want all the Importance : H first, and then the CAT1, CAT2, CAT3. Finally Importance : L. Thanks! – Peter Jul 26 '21 at 03:09
  • So for (2) I've added a code block about specifying a categoricaldtype for custom ordering. Making `H` come before `L` for example. For (1) there's a bunch of examples in [MultiIndex / advanced indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#creating-a-multiindex-hierarchical-index-object). But something like -> `df1.columns = pd.MultiIndex.from_arrays([['L'], ['Cat1'], df1.columns])`. – Henry Ecker Jul 26 '21 at 03:19
  • Thanks for the inputs! I am using pandas 0.20.1 And I get error no attribute CategoricalDtype. So how can I re-order the columns using my ordered categories? Also I am passing in the the top level & 2nd level keys as a list. Eg impt=['L', 'H'] – Peter Jul 26 '21 at 06:54
  • There are lots of options in [Custom sorting in pandas dataframe](https://stackoverflow.com/q/13838405/15497888) many work for 0.20. You’d likely have to use Categorical instead. Map may also work. – Henry Ecker Jul 26 '21 at 12:40