Best Practices to add custom attributes to pandas Dataframe

Question

I am building a Table class to make it easy to retrieve data from a database, manipulate it arbitrarily in memory, then save it back. Ideally, these tables work for the python interpreter and normal code. "Work" means I can use all standard pandas Dataframe features, as well as all custom features from the Table class.

Generally, the tables contain data I use for academic research or personal interest. So, the user-base is currently just me, but for portability I'm trying to write as generically as possible.

I have seen several threads (example 1, example 2) discussing whether to subclass DataFrame, or use composition. After trying to walk through pandas's subclassing guide I decided to go for composition because pandas itself says this is easier.

The problem is, I want to be able to call any Dataframe function, property, or attribute on a Table, but I to do so, I have to keep track of any attribute I code into the Table class. See below, points of interest are metadata and __getattr__, everything else is meant to be illustrative.

class Table(object):
    metadata = ['db', 'data', 'name', 'clean', 'refresh', 'save']

    def __getattr__(self, name):
        if name not in Table.metadata:
            return getattr(self.data, name) #self.data is the Dataframe
    
    def __init__(self, db, name):
        #set up Table specific values

    def refresh(self):
        #undo all changes since last save

    etc...

Obviously, having to explicitly specify the Table attributes versus the Dataframe ones is not ideal (though--to my understanding--this is how pandas implements column names as attributes). I could write out tablename.data.foo, but I find that unintuitive and non-pythonic. Is there a better way to achieve the same functionality?

score 0 · Answer 1 · answered Aug 12 '22 at 02:20

Here's my understanding of your desired workflow: (1) you have a table in a database, (2) you read part/all(?) of it into memory as a pandas dataframe wrapped in a custom class (3) you make any manipulation you want and then (4) save it back to the database as the new state of that table.

I'm worried that arbitrary changes to the df could break db features

I'm guessing this is a relational db? Do other tables rely on primary keys of this table?
Are you trying to keep a certain schema?
Are you ok with adding/deleting/renaming columns arbitrarily?

If you decide there are an enumerable amount of manipulations, rather than an arbitrary amount, then I'd make a separate class method for each.

If you don't care about your db schema and your db table doesn't have relationships with other tables then I guess you can do arbitrary manipulations in memory and replace the db table each time

In this case I feel you are not benefiting from using a database over a CSV file
I guess one benefit could be the db is publicly accessible while the CSV wouldn't be (unless you were using S3 or something)

Best Practices to add custom attributes to pandas Dataframe

1 Answers1