2

I am ignoring the warnings and trying to subclass a pandas DataFrame. My reasons for doing so are as follows:

  • I want to retain all the existing methods of DataFrame.
  • I want to set a few additional attributes at class instantiation, which will later be used to define additional methods that I can call on the subclass.

Here's a snippet:

class SubFrame(pd.DataFrame):

    def __init__(self, *args, **kwargs):
        freq = kwargs.pop('freq', None)
        ddof = kwargs.pop('ddof', None)
        super(SubFrame, self).__init__(*args, **kwargs)
        self.freq = freq
        self.ddof = ddof
        self.index.freq = pd.tseries.frequencies.to_offset(self.freq)

    @property
    def _constructor(self):
        return SubFrame

Here's a use example. Say I have the DataFrame

print(df)
               col0     col1     col2
2014-07-31  0.28393  1.84587 -1.37899
2014-08-31  5.71914  2.19755  3.97959
2014-09-30 -3.16015 -7.47063 -1.40869
2014-10-31  5.08850  1.14998  2.43273
2014-11-30  1.89474 -1.08953  2.67830

where the index has no frequency

print(df.index)
DatetimeIndex(['2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31',
               '2014-11-30'],
              dtype='datetime64[ns]', freq=None)

Using SubFrame allows me to specify that frequency in one step:

sf = SubFrame(df, freq='M')
print(sf.index)
DatetimeIndex(['2014-07-31', '2014-08-31', '2014-09-30', '2014-10-31',
               '2014-11-30'],
              dtype='datetime64[ns]', freq='M')

The issue is, this modifies df:

print(df.index.freq)
<MonthEnd>

What's going on here, and how can I avoid this?

Moreover, I profess to using copied code that I don't understand all that well. What is happening within __init__ above? Is it necessary to use args/kwargs with pop here? (Why can't I just specify params as usual?)

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • The typical use-case for subclassing is to modify/extend the base class functionality in some way. You're not really doing that here. You're simply setting up your DataFrame in a specific manner. Rather than subclass, you might want to simply create a factory type function/object that simply returns a DataFrame constructed in the way that you want. Subclassing doesn't seem to make a lot of sense for this particular use-case. – clockwatcher Jul 21 '17 at 20:48
  • "create a factory type function/object that simply returns a DataFrame constructed in the way that you want." Can you elaborate on that please? Right now what I have is a standard `Class(object)` where the dataframe is an attribute. And yes, I am extending the functionality by defining more than a handful of other methods, not shown here. – Brad Solomon Jul 21 '17 at 20:51
  • 1
    Look at piRSquared's second suggestion with the pipe. Notice he's not subclassing. He's creating a function that simply returns a DataFrame in the way you want it set up. There's no need to subclass. You're not changing behavior. One of the uses of a factory class is to create the object set up in the way you want it. If you want to base a new DataFrame on a copy of the first, you would create a function that takes the existing DataFrame as a parameter, copies it, adds your frequency, and then returns the copy. No subclass. – clockwatcher Jul 21 '17 at 21:02

1 Answers1

3

I'll add to the warnings. Not that I want to discourage you, I actually applaud your efforts.

However, this won't the last of your questions as to what is going on.

That said, once you run:

super(SubFrame, self).__init__(*args, **kwargs)

self is a bone-fide dataframe. You created it by passing another dataframe to the constructor.

Try this as an experiment

d1 = pd.DataFrame(1, list('AB'), list('XY'))
d2 = pd.DataFrame(d1)

d2.index.name = 'IDX'

d1

     X  Y
IDX      
A    1  1
B    1  1

So the observed behavior is consistent, in that when you construct one dataframe by passing another dataframe to the constructor, you end up pointing to the same objects.

To answer your question, subclassing isn't what is allowing the mutating of the original object... its the way pandas constructs a dataframe from a passed dataframe.

Avoid this by instantiating with a copy

d2 = pd.DataFrame(d1.copy())

What's going on in the __init__

You want to pass on all the args and kwargs to pd.DataFrame.__init__ with the exception of the specific kwargs that are intended for your subclass. In this case, freq and ddof. pop is a convenient way to grab the values and delete the key from kwargs before passing it on to pd.DataFrame.__init__


How I'd implement pipe

def add_freq(df, freq):
    df = df.copy()
    df.index.freq = pd.tseries.frequencies.to_offset(freq)
    return df

df = pd.DataFrame(dict(A=[1, 2]), pd.to_datetime(['2017-03-31', '2017-04-30']))

df.pipe(add_freq, 'M')
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I follow you except for "Avoid this by instantiating with a copy." Where would `copy()` go within `super(SubFrame, self) ...` and why? I would've suspected `self.copy` based on how `super` works but that's throwing error – Brad Solomon Jul 21 '17 at 20:39
  • I'm not sure I'd make `SubFrame.__init__` force a copy (read that as I'm truly not sure. I could swayed either way). Because you'd be changing something that pandas does. Instead, I'd change how you created your variable `sf`: `sf = SubFrame(df, freq='M')` to this instead `sf = SubFrame(df.copy(), freq='M')` – piRSquared Jul 21 '17 at 20:42
  • Okay--and there are 2 alternatives to subclassing suggested [here](http://pandas.pydata.org/pandas-docs/stable/internals.html#subclassing-pandas-data-structures)--would you say that either of those apply to what I'm trying to do in this specific case? – Brad Solomon Jul 21 '17 at 20:43
  • I'm not certain what composition is as I've not studied many CS concepts. But I believe composition is what I do. I create a new object class in which the dataframe is an attribute. I manage everything else that way. I believe that `pipe` will do the trick for you... I'll update my post with an example. – piRSquared Jul 21 '17 at 20:48
  • Ahh okay. Then composition is what I have now (before I tried subclassing). Will probably end up using `pipe` here, appreciate the edit – Brad Solomon Jul 21 '17 at 21:04