3

I would like to store the summary of a local set of DataFrames into a "meta DataFrame" using pd.MultiIndex.

Basically, row-axis has two levels, and column-axis also. In the class managing the set of DataFrames, I define as a class variable this "Meta DataFrame".

import pandas as pd

row_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Period'])
column_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Extrema'])
MD = pd.DataFrame(index=row_axis, columns=column_axis)

It seems to work.

MD.index
>>> MultiIndex([], names=['Data', 'Period'])

MD.columns
>>> MultiIndex([], names=['Data', 'Extrema'])

Now, each time I process an individual DataFrame id, I want to update this "Meta DataFrame" accordingly. id has a DateTimeIndex with period '5m'.

id.index[0]
>>> Timestamp('2020-01-01 08:00:00')

id.index[-1]
>>> Timestamp('2020-01-02 08:00:00')

I want to keep in MD its first and last index values for instance.

MD.loc[[('id', '5m')],[('Timestamp', 'First')]] = id.index[0]
MD.loc[[('id', '5m')],[('Timestamp', 'Last')]] = id.index[-1]

This doesn't work, I get following error message:

TypeError: unhashable type: 'list'

In the end, the result I would like is to have in MD following type of info (I am having other id DataFrames with different periods) :

           Timestamp
           First                   Last
id    5m   2020-01-01 08:00:00     2020-01-02 08:00:00
     10m   2020-01-05 08:00:00     2020-01-06 18:00:00

Ultimately, I will also keep min and max of some columns in id. For instance if id has a column 'Temperature'.

           Timestamp                                     Temperature
           First                Last                     Min    Max
id    5m   2020-01-01 08:00:00  2020-01-02 08:00:00      -2.5   10
     10m   2020-01-05 08:00:00  2020-01-06 18:00:00      4      15

These values will be recorded when I record id.

I am aware initializing a DataFrame cell per cell is not time efficient, but it will not be done that often.

Besides, I don't see how I can manage this organization of information in a Dict, which is why I am considering doing it with a multi-level DataFrame. I will then dump it in a csv file to store these "meta data".

Please, what is the right way to initialize each of these values in MD?

I thank you for your help! Bests,

pierre_j
  • 895
  • 2
  • 11
  • 26

1 Answers1

2

Instead of filling an empty DataFrame you can store the data in a dict of dicts. A MultiIndex uses tuples as the index values so we make the keys of each dictionary tuples.

The outer Dictionary uses the column MultiIndex tuples as keys and the values are another dictionary with the row MultiIndex tuples as keys and the value that goes in a cell as the value.

d = {('Score', 'Min'):       {('id1', '5m'): 72, ('id1', '10m'): -18},
     ('Timestamp', 'First'): {('id1', '5m'): 1, ('id1', '10m'): 2},
     ('Timestamp', 'Last'):  {('id1', '5m'): 10, ('id1', '10m'): 20}}
     #        |                     |                            |
     #  Column MultiIndex       Row Multi                    Cell Value
     #       Label                Label     

pd.DataFrame(d)

        Score Timestamp     
          Min     First Last
id1 5m     72         1   10
    10m   -18         2   20

Creating that dict will depend upon how you get the values. You can extend a dict with update

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thanks for your support ALollz, however, I am not sure what to do with what you propose. I get value one by one. So even if I can do what you propose for the first value, then when I have a second value to manage and that I have already turned my dict into a dataframe, how do I update the DataFrame then? I am actually back on my initial question: how to add and/or modifiy a cell in a multi-index DataFrame. – pierre_j Apr 30 '20 at 16:07
  • @pierre_j it's difficult without knowing how you're getting these individual values. But the dict should be no different from the MultiIndex. For instance starting with `d={}` you could add a value like `d[('Score', 'Min')] = {('id', '5m'): 72}`. Then if you needed to add another row you'd do something like `d[('Score', 'min')].update({('id', '10m'): 5})` Ideally you're not typing this out every time and instead you'd have some programatic way to reference the row and Column MultiIndex labels so that the dict could be built in a very small number of lines. – ALollz Apr 30 '20 at 16:22
  • On the other hand, if these are coming from different DataFrames, there's probably a *much* easier way to get all of the information you need and then you can `concat` using `keys` to get everything labeled. Again this all requires a bit of thoughtful organization beforehand to keep things manageable and self-naming. – ALollz Apr 30 '20 at 16:24
  • The update of this multi index DataFrame will occur whenever I record the individual DataFrame in a file. If I record an individual DataFrame in a file, it is because I have either created it, or extended it. At this time, the values I want to keep in the multi-index DataFrame may have changed (the last DateTimeIndex value for instance, or a min or max of a column) So at this time, either I need to add new values in multi-index DataFrame, or to modify existing ones. – pierre_j Apr 30 '20 at 16:28
  • I don't have all the values at once. The mulit-index DataFrame is way to store "living" data – pierre_j Apr 30 '20 at 16:29