2

I'm a novice in python&pandas, could you please advise me if it is possible to ungroup and unpivot such dataframe?

The groups in source data are in the only column marked by prefix spaces and look like

import pandas as pd
import numpy
df = pd.DataFrame([
    ['Costs', numpy.nan, numpy.nan, numpy.nan],
    ['  Vehicles', numpy.nan, numpy.nan, numpy.nan],
    ['    Cars', numpy.nan, numpy.nan, numpy.nan],
    ['      BMW', 1000, 1100, 1010],
    ['      Toyota', 1203, 1302, 1322],
    ['    Cars - Total', 2203, 2402, 2332],
    ['    Trucks', numpy.nan, numpy.nan, numpy.nan],
    ['      Volvo', 5000, 5001, 5010],
    ['      MAN', 5500, 5055, 5066],
    ['    Trucks - Total', 10500, 10056, 10076],
    ['  Vehicles - Total', 12703, 12458, 12408],
    ['  Crew', numpy.nan, numpy.nan, numpy.nan],
    ['    Gomez Addams', 10000, 10000, 10000],
    ['    Morticia Addams', 10000, 10000, 10000],
    ['  Crew - Total', 20000, 20000, 20000],
    ['Costs - Total', 32703, 32458, 32408],
    ],    
    columns=['Level', 'Q1_2019', 'Q2_2019', 'Q3_2019'])

I need to convert it to a table like

Level, Sublevel1, Sublevel2, Sublevel3, Sublevel4, Date, Value
"Costs", "Vehicles", "Cars", "BMW", "Q1_2019", 1000
"Costs", "Crew", "Gomez Addams", , "Q1_2019", 10000

For now I've creates extra "Sublevel" columns, backfilled them by regex, then fill the Sublevels gaps row-by-row, then apply melt(). Can it be done more pythonish?

evesq
  • 21
  • 2
  • could you do a print(df) or a df.to_dict(). It would make it easier for others to contribute. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – sammywemmy Jan 27 '20 at 12:43
  • Since the amount of space are important in your question, I would advice you to provide an example dataframe constructed by `pd.DataFrame` so that there cannot be mistakes when people try to copy and paste the dataset you have now. – Erfan Jan 27 '20 at 12:44
  • `pd.read_csv(StringIO(d),sep='|',skiprows=1).iloc[:,1:-1].dropna(how='all')` using StringIO reads the dataframe correctly with the spaces. `d` is the table held in a multiline string – Umar.H Jan 27 '20 at 12:48
  • 1
    Not working for me @Datanovice, still it should be best practice to include your dataset in a manner people can easily reproduce your problem. For example `pd.DataFrame` or `df.to_dict` – Erfan Jan 27 '20 at 12:52
  • @Erfan did you copy the table into your editor as well? i did that first and it in a multi line string `"""table"""` just tested again on a fresh virtual env and it worked fine – Umar.H Jan 27 '20 at 13:00

1 Answers1

0

Probably a neater way to do this, but the idea is to filter out the groups by using Total then use a back and fowards fill.

we then drop anything in group with less than 1 occurrence and melt by the levels 1-2

df['sub_level'] = df['Level'].str.count('\s+')

df.loc[df["Level"].str.contains("Total"), "group"] = (
    df["Level"].str.strip().str.replace("- Total", "")
)

df['group'] = df['group'].bfill().ffill()

df = df[df.groupby('group')['group'].transform('count') > 1].dropna(how='any')

final_df = pd.melt(
    df.loc[df["sub_level"].isin([1, 2])].drop("sub_level", axis=1), id_vars=["Level",'group']
)

final_df.columns = ['Level','Type','Date','Value']

print(final_df)

                  Level     Type     Date    Value
0                   BMW    Cars   Q1_2019   1000.0
1                Toyota    Cars   Q1_2019   1203.0
2                 Volvo  Trucks   Q1_2019   5000.0
3                   MAN  Trucks   Q1_2019   5500.0
4          Gomez Addams    Crew   Q1_2019  10000.0
5       Morticia Addams    Crew   Q1_2019  10000.0
6                   BMW    Cars   Q2_2019   1100.0
7                Toyota    Cars   Q2_2019   1302.0
8                 Volvo  Trucks   Q2_2019   5001.0
9                   MAN  Trucks   Q2_2019   5055.0
10         Gomez Addams    Crew   Q2_2019  10000.0
11      Morticia Addams    Crew   Q2_2019  10000.0
12                  BMW    Cars   Q3_2019   1010.0
13               Toyota    Cars   Q3_2019   1322.0
14                Volvo  Trucks   Q3_2019   5010.0
15                  MAN  Trucks   Q3_2019   5066.0
16         Gomez Addams    Crew   Q3_2019  10000.0
17      Morticia Addams    Crew   Q3_2019  10000.0
Umar.H
  • 22,559
  • 7
  • 39
  • 74