1

I'm trying to implement a df.name attribute for my Dataframes. I have a lot of reasons to do this and store other metadata in my class that inherits pd.DataFrame, but I won't get into that here... I'm using the 'name' as an example of metadata. If this doesn't work for something simple, then I'll use a different approach entirely.

Research: DataFrame.name won't survive pickling. The only option is an experimental df.attr['name'] that I don't trust and don't like the access implementation. I'll consider that option if this doesn't work. Get the name of a pandas DataFrame

There's a discussion about adding the df.name attribute on Pandas GitHub: https://github.com/pandas-dev/pandas/issues/447#issuecomment-10949838 But it's gridlocked based on disagreement about use cases and implementation difficulties. No movement in 11 years...

How to Pickle yourself: I'm trying to override the df.to_pickle() and pd.read_pickle() methods based on this example: How to pickle yourself?

class NamedDataFrame(pd.DataFrame):
    '''
    a dataframe with a name
    '''
    def __init__(self, data=None, index=None, columns=None, dtype=None, copy=False, name: str = None):
        super().__init__(data, index, columns, dtype, copy)
        self.name = name
    
    #override the pickling methods to include the name
    def to_pickle(self, path, compression='infer', protocol=4):
        print("pickling myself")
        with open(path, 'wb') as f:
            pickle.dump(self, f, protocol)

    @classmethod
    def read_pickle(cls, path):
        with open(path, 'rb') as f:
            return pickle.load(f)

But, no worky...

>>> ndf = NamedDataFrame(data=mydf,name='mytestname')
>>> ndf.name
mytestname

>>> ndf.to_pickle(mypath)
pickling myself

>>> pndf = NamedDataFrame.read_pickle(mypath)
>>> pndf
(shows dataframe output to confirm reading from pickle worked)

>>> pndf.name
AttributeError: 'NamedDataFrame' object has no attribute 'name'

What gives? It seems like I'm missing something huge here on how pickling works, and I'd like to understand what I'm missing, and hopefully find a solution to this problem.

turbonate
  • 159
  • 2
  • 13
  • I also tried using dill in case the issue was with the interpreter session: `import dill as pickle` Didn't work yet. – turbonate Feb 26 '23 at 10:53
  • I tried your code and can't reporduce the issue. In my experiment the attribute was set. I did use `pandas 1.0.1`. – mosc9575 Feb 26 '23 at 10:56

1 Answers1

1

There is no need to override methods. When subclassing a pandas dataframe there is a recommended approach basically any user defined/custom attributes should be defined as a part of _metadata list. The names defined in this list will be persisted during the pickling and unpickling operations.

Solution

class NamedDataFrame(pd.DataFrame):
    _metadata = ['name'] # !important
    
    def __init__(self, *args, name: str, **kwargs):
        super().__init__(*args, **kwargs)
        self.name = name

Worked out example

df = NamedDataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']}, name='foo')
print(df)

#    a  b
# 0  1  x
# 1  2  y
# 2  3  z

# write the dataframe
df.to_pickle('foo_data.pkl')

# read the dataframe
df = pd.read_pickle('foo_data.pkl')
print(df)

#    a  b
# 0  1  x
# 1  2  y
# 2  3  z

print(df.name)
# foo
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53