I wanted to add a unique id to my DataFrames, and I essentially succeeded by using what I found here, Python Class Decorator. I know from here https://github.com/pydata/pandas/issues/2485 that adding custom metadata is not yet explicitly supported, but decorators seemed like a workaround.
My decorated DataFrames return new and similarly decorated DataFrames when I use methods such as copy and groupby.agg. How can I have "all" pandas functions like pd.DataFrame() or pd.read_csv return my decorated DataFrames instead of original, undecorated DataFrames without decorating each pandas function individually? I.e., how can I have my decorated DataFrames replace the stock DataFrames?
Here's my code. First, I have an enhanced pandas module, wrapPandas.py.
from pandas import *
import numpy as np
def addId(cls):
class withId(cls):
def __init__(self, *args, **kargs):
super(withId, self).__init__(*args, **kargs)
self._myId = np.random.randint(0,99999)
return withId
pandas.core.frame.DataFrame = addId(pandas.core.frame.DataFrame)
Running the following snippet of code shows my DataFrame returning decorated DataFrames when I use methods such as .copy() and .groupby().agg(). I will then follow this up by showing that pandas functions such as pd.DataFrame don't return my decorated DataFrames (sadly though not surprisingly).
EDIT: added import statement per Jonathan Eunice's response.
import wrapPandas as pd
d = {
'strCol': ['A', 'B', 'A', 'C', 'B', 'B', 'A', 'C', 'A'],
'intCol': [6,3,8,6,7,3,9,2,6],
}
#create "decorated" DataFrame
dfFoo = pd.core.frame.DataFrame.from_records(d)
print("dfFoo._myId = {}".format(dfFoo._myId))
#new DataFrame with new ._myId
dfBat = dfFoo.copy()
print("dfBat._myId = {}".format(dfBat._myId))
#new binding for old DataFrame, keeps old ._myId
dfRat = dfFoo
print("dfRat._myId = {}".format(dfRat._myId))
#new DataFrame with new ._myId
dfBird = dfFoo.groupby('strCol').agg({'intCol': 'sum'})
print("dfBird._myId = {}".format(dfBird._myId))
#all of these new DataFrames have the same type, "withId"
print("type(dfFoo) = {}".format(type(dfFoo)))
And this yields the following results.
dfFoo._myId = 66622
dfBat._myId = 22527
dfRat._myId = 66622
dfBird._myId = 97593
type(dfFoo) = <class 'wrapPandas.withId'>
And the sad part. dfBoo._myId
raises, of course, an AttributeError
.
#create "stock" DataFrame
dfBoo = pd.DataFrame(d)
print(type(dfBoo))
#doesn't have a ._myId (I wish it did, though)
print(dfBoo._myId)