1

I am sorry, I am aware the title is somewhat fuzzy.

Context

I am using a Dataframe to keep track of files because pandas DataFrame features several relevant functions to do all kind of filtering a dict cannot do, with loc, pd.IndexSlice, .index, .columns, pd.MultiIndex... Ok, so this may not appear as the best choice for expert developers (which I am not), but all these functions have been so much handy that I have come to use a DataFrame for this. And cherry on the cake, __repr__ of a MultiIndex Dataframe is just perfect when I want to know what is inside my file list.

Quick introduction to Summary class, inheriting from DataFrame

Because my DataFrame, that I call 'Summary', has some specific functions, I would like to make it a class, inheriting from pandas DataFrame class. It also has 'fixed' MultiIndexes, for both rows and columns.

Finally, because my Summary class is defined outside the Store class which is actually managing file organization, Summary class needs a function from Store to be able to retrieve file organization.

Questions

Trouble with pd.DataFrame is (AFAIK) you cannot append rows without creating a new DataFrame. As Summary has a refresh function so that it can recreate itself by reading folder content, a refresh somehow 'reset' the 'Summary' object. To manage Summary refresh, I have come up with a first code (not working) and finally a second one (working).

import pandas as pd
import numpy as np

# Dummy function
def summa(a,b):
    return a+b

# Does not work
class DatF1(pd.DataFrame):

    def __init__(self,meth,data=None):
        cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
        rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
                              names=['Component','Interval'])
        super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
        self.meth=meth

    def refresh(self):
        values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
                  [pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
        rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
        self = pd.DataFrame(values, index=rmidx, columns=self.columns)

ex1 = DatF1(summa)

In [10]: ex1.meth(3,4)
Out[10]: 7

ex1.refresh()
In [11]: ex1
Out[11]: Empty DatF1
         Columns: [(Index, First), (Index, Last)]
         Index: []

After refresh(), ex1 is still empty. refresh has not worked correctly.

# Works
class DatF2(pd.DataFrame):

    def __init__(self,meth,data=None):
        cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
        rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
                              names=['Component','Interval'])
        super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
        self.meth=meth

    def refresh(self):
        values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
                  [pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
        rmidx = pd.MultiIndex.from_arrays([['Comp1','Comp1'],['1h','1W']],names=['Component','Interval'])
        super().__init__(values, index=rmidx, columns=self.columns)

    ex2 = DatF2(summa)

    In [10]: ex2.meth(3,4)
    Out[10]: 7

    ex2.refresh()
    In [11]: ex2
    Out[11]:                                  Index                    
                                              First                Last
             Component Interval                                        
             Comp1     1h       2020-02-10 08:00:00 2020-02-10 08:00:00
                       1W       2020-02-11 08:00:00 2020-02-12 08:00:00

This code works!

I have 2 questions:

  • why the 1st code is not working? (I am sorry, this is maybe obvious, but I am completely ignorant why it does not work)

  • is calling super().__init__ in my refresh method acceptable coding practise? (or rephrased differently: is it acceptable to call super().__init__ in other places than in __init__ of my subclass?)

Thanks a lot for your help and advice. The world of class inheritance is for me quite new, and the fact that DataFrame content cannot be directly modified, so to say, seems to me to make it a step more difficult to handle.

Have a good day, Bests,

Error message when adding a new row

import pandas as pd
import numpy as np

# Dummy function
def new_rows():
    return [['Comp1','Comp1'],['1h','1W']]

# Does not work
class DatF1(pd.DataFrame):

    def __init__(self,meth,data=None):
        cmidx = pd.MultiIndex.from_arrays([['Index', 'Index'],['First', 'Last']])
        rmidx = pd.MultiIndex(levels=[[],[]], codes=[[],[]],
                          names=['Component','Interval'])
        super().__init__(data=data, index=rmidx, columns=cmidx, dtype=np.datetime64)
        self.meth=meth

    def refresh(self):
        values = [[pd.Timestamp('2020/02/10 8:00'),pd.Timestamp('2020/02/10 8:00')],
                  [pd.Timestamp('2020/02/11 8:00'),pd.Timestamp('2020/02/12 8:00')]]
        rmidx = self.meth()
        self[rmidx] = values

ex1 = DatF1(new_rows)
ex1.refresh()

KeyError: "None of [MultiIndex([('Comp1', 'Comp1'),\n            (   '1h',    '1W')],\n           names=['Component', 'Interval'])] are in the [index]"
pierre_j
  • 895
  • 2
  • 11
  • 26

1 Answers1

1

Answers to your questions

why the 1st code is not working?

You are trying to call the class you've inherited from. Honestly, I don't know what's happening exactly in your case. I assumed this would produce an error but you got an empty dataframe.

is calling super().__init__ in my refresh method acceptable coding practise?

Maybe a legitimate use case exists for calling super().__init__ outside the __init__() method. But your case is not one of them. You have already inherited evertyhing from in your __init__() . Why use it again?


A better solution

The solution to your problem is unexpectedly simple. Because you can append a row to a Dataframe:

df['new_row'] = [value_1, value_2, ...]

Or in your case with an MultiIndex (see this SO post):

  df.loc[('1h', '1W'), :] = [pd.Timestamp('2020/02/10 8:00'), pd.Timestamp('2020/02/10 8:00')]

Best practice

You should not inherit from pd.DataFrame. If you want to extend pandas use the documented API.

above_c_level
  • 3,579
  • 3
  • 22
  • 37
  • Hello, thanks for the reply. Well, I actually tried to append new rows, and I copied-pasted above the code as an addendum. I do not succeed to make it work, and I also show the error message. Do you have any idea about what can be wrong? Regarding the best practises, thanks a lot. I did not know about them. Ok, I see I can add methods with this such as the 'refresh' method. But I could not spot in this documentation where I can format the MultiIndex of my columns and my rows. Do you have any additional pointers maybe? Thanks a lot for your help! Really appreciated! – pierre_j Jun 21 '20 at 15:00
  • Coming back to the "best practise". I have read several times the link you provide. Nowhere I see a way to actually extend a DataFrame with methods that actually modify it, I am sorry. So it does seem I am forced to use inheritance. Please, am I wrong? – pierre_j Jun 21 '20 at 15:19
  • I think you don't need any inheritance or extending of your dataframe at all. The "standard" DataFrame can be changed without the need of a refresh() method. I have updated my answer, because I did not account for the MultiIndex. The syntax is slightly different. – above_c_level Jun 21 '20 at 15:35