associate decorated DataFrame with all pandas functions

Question

I wanted to add a unique id to my DataFrames, and I essentially succeeded by using what I found here, Python Class Decorator. I know from here https://github.com/pydata/pandas/issues/2485 that adding custom metadata is not yet explicitly supported, but decorators seemed like a workaround.

My decorated DataFrames return new and similarly decorated DataFrames when I use methods such as copy and groupby.agg. How can I have "all" pandas functions like pd.DataFrame() or pd.read_csv return my decorated DataFrames instead of original, undecorated DataFrames without decorating each pandas function individually? I.e., how can I have my decorated DataFrames replace the stock DataFrames?

Here's my code. First, I have an enhanced pandas module, wrapPandas.py.

from pandas import *
import numpy as np

def addId(cls):

    class withId(cls):

        def __init__(self, *args, **kargs):
            super(withId, self).__init__(*args, **kargs)
            self._myId = np.random.randint(0,99999)

    return withId

pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)

Running the following snippet of code shows my DataFrame returning decorated DataFrames when I use methods such as .copy() and .groupby().agg(). I will then follow this up by showing that pandas functions such as pd.DataFrame don't return my decorated DataFrames (sadly though not surprisingly).

EDIT: added import statement per Jonathan Eunice's response.

import wrapPandas as pd

d = {
    'strCol': ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A'], 
    'intCol': [6,3,8,6,7,3,9,2,6], 
}

#create "decorated" DataFrame
dfFoo = pd.core.frame.DataFrame.from_records(d)
print("dfFoo._myId = {}".format(dfFoo._myId))

#new DataFrame with new ._myId
dfBat = dfFoo.copy()
print("dfBat._myId = {}".format(dfBat._myId))

#new binding for old DataFrame, keeps old ._myId
dfRat = dfFoo
print("dfRat._myId = {}".format(dfRat._myId))

#new DataFrame with new ._myId
dfBird = dfFoo.groupby('strCol').agg({'intCol': 'sum'})
print("dfBird._myId = {}".format(dfBird._myId))

#all of these new DataFrames have the same type, "withId"
print("type(dfFoo) = {}".format(type(dfFoo)))

And this yields the following results.

dfFoo._myId = 66622
dfBat._myId = 22527
dfRat._myId = 66622
dfBird._myId = 97593
type(dfFoo) = <class 'wrapPandas.withId'>

And the sad part. dfBoo._myId raises, of course, an AttributeError.

#create "stock" DataFrame
dfBoo = pd.DataFrame(d)
print(type(dfBoo))

#doesn't have a ._myId (I wish it did, though)
print(dfBoo._myId)

Jonathan Eunice · Accepted Answer · 2015-04-06T18:29:48.937

2

Modify your monkey patch to:

pd.DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)

I.e. so you are "latching on" or "monkey patching" two different names. This need to double-assign may seem weird, given that pandas.core.frame.DataFrame is pd.DataFrame. But you are not actually modifying the DataFrame class. You are injecting a proxy class. Whatever references are to the proxy worked. The ones that were direct to the original class did not get the proxy behavior. Change that by having all the names you might want to use point to the proxy. Here's how it looks more diagrammatically:

diagram

I assume you also have an import pandas as pd somewhere in your file that's not shown, else your definition of dfBoo would fail with NameError: name 'pd' is not defined.

Monkey patching is dangerous for reasons like this. You're injecting things...and it's impossible to know if you "caught all the references" or "patched everything you need to." I can't promise that there won't be other calls in the code that address structures at a lower level than this name rejiggering won't effect. But for the code displayed, it works!

Update You later asked how to make this work for pd.read_csv. Well, that's yet another of the places you might need to monkey patch. In this case, amend the patch code above to:

pd.DataFrame = pandas.io.parsers.DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)

Patching the definition of DataFrame inside pandas.io.parsers.DataFrame will do the trick for read_csv. Same caveat applies: There could be (i.e. probably are) more uses you'd need to track down for full coverage.

edited Apr 06 '15 at 18:29

answered Apr 06 '15 at 16:07

Jonathan Eunice

21,653
6
75
77

That was very helpful. If I understand this correctly, in my code above, I was only monkey patching `pandas.core.frame.DataFrame`. Your recommendation is that I also monkey patch `pd.DataFrame`? When I implement that by changing the last line of wrapPandas.py to `DataFrame = pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)`, the call to `pd.DataFrame` works. I then tried pd.read_csv, and that didn't work, which is what you were warning me about, no? – MalcolmSBritton Apr 06 '15 at 17:37
Exactly. You aren't really changing the behavior of the core `DataFrame` class. Instead, you're inserting a proxy between that class and its users--including its internal users. But to be successful, that style of proxy insertion has to patch *all* the names that could possibly be instantiated for `DataFrame`. It's improbable that you'd catch them all, at least on your first time out. – Jonathan Eunice Apr 06 '15 at 18:08
`pd.read_csv` turns out to not even be a standard function. It's an instantiation of a much later "parser function factory" called `_make_parser_function`. [See the code here.](https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py#L476-477) Given enough time, you might be able to track through the code, find where it's really building its df, and patch that. You've accomplished a lot for your short code. But going the rest of the rest of the way is more uphill. Wouldn't `id(df)` suffice? – Jonathan Eunice Apr 06 '15 at 18:20
I just updated the answer to indicate how to also patch `pandas.io.parsers.DataFrame` and get `read_csv` returning your proxy. You'll probably have to replicate that process for other `pandas` submodules to make your proxy universal. – Jonathan Eunice Apr 06 '15 at 18:30
Yes! id(df) seems to work great! Almost glad I didn't find it sooner as this was very educational. – MalcolmSBritton Apr 06 '15 at 19:04

associate decorated DataFrame with all pandas functions

1 Answers1