0

An existing question, Creating a Pandas Dataframe with Multi Column Index, deals with a very "regular" DataFrame where all columns and rows are products and all data is present.

My situation is, alas, different. I have this kind of data:

[{"street": "Euclid", "house":42, "area":123, (1,"bedrooms"):1, (1,"bathrooms"):4},
 {"street": "Euclid", "house":19, "area":234, (2,"bedrooms"):3, (2,"bathrooms"):3},
 {"street": "Riemann", "house":42, "area":345, (1,"bedrooms"):5,
  (1,"bathrooms"):2, (2,"bedrooms"):12, (2, "bathrooms"):17},
 {"street": "Riemann", "house":19, "area":456, (1,"bedrooms"):7, (1,"bathrooms"):1}]

and I want this sort of DataFrame with both rows and columns having multi-level indexes:

              area          1                  2
street house        bedrooms bathrooms bedrooms bathrooms
Euclid  42    123     1         4
Euclid  19    234                         3         3
Riemann 42    345     5         2        12        17
Riemann 19    456     7         1

So, the row index should be

MultiIndex([("Euclid",42),("Euclid",19),("Riemann",42),("Riemann",19)],
           names=["street","house"])

and the columns index should be

MultiIndex([("area",None),(1,"bedrooms"),(1,"bathrooms"),(2,"bedrooms"),(2,"bathrooms")],
           names=["floor","entity"])

and I see no way to generate these indexes from the list of dictionaries I have.

rpanai
  • 12,515
  • 2
  • 42
  • 64
sds
  • 58,617
  • 29
  • 161
  • 278

1 Answers1

1

i feel there should be something better than this; hopefully someone on SO puts out sth much better:

Create a function to process each entry in the dictionary:

def process(entry):
    #read in data and get the keys to be the column names
    m = pd.DataFrame.from_dict(entry,orient='index').T
    #set index
    m = m.set_index(['street','house'])
    #create multi-index columns
    col1 = [ent[0] if isinstance(ent,tuple) else ent for ent in m.columns ]
    col2 = [ent[-1] if isinstance(ent,tuple) else None for ent in m.columns ]
    #assign multi-index column to m
    m.columns=[col1,col2]
    return m

Apply function above to data(i wrapped the dictionary into the data variable):

res = [process(entry) for entry in data]

concatenate to get final output

pd.concat(res)

                area               1                  2
                NaN    bedrooms bathrooms   bedrooms    bathrooms
street  house                   
Euclid    42    123     1        4           NaN         NaN
          19    234     NaN      NaN         3           3
Riemann   42    345     5        2           12          17
          19    456     7        1           NaN         NaN
sammywemmy
  • 27,093
  • 4
  • 17
  • 31