Hashing a pandas dataframe for calculated column caching

Question

I am using composition method to create a class with a contained pandas dataframe as shown below. I am creating a derived property by doing some operation on the base columns.

import numpy as np
import pandas as pd

class myclass:
    def __init__(self, *args, **kwargs):
        self.df = pd.DataFrame(*args, **kwargs)
    @property
    def derived(self):
        return self.df.sum(axis=1)

myobj = myclass(np.random.randint(100, size=(100,6)))
d = mc.derived

The calculation of derived is an expensive step and hence I would like to cache this function. I want to use functools.lru_cache for the same. However, it requires that the original object be hashed. I tried creating a __hash__ function for the object as detailed in this answer https://stackoverflow.com/a/47800021/3679377.

Now I run in to a new problem where the hashing function is an expensive step!. Is there any way to get around this problem? Or have I reached a dead end?

Is there any better way to check if a dataframe has been modified and if not, keep returning the same hash?

'I am creating a custom class by extending a pandas dataframe as shown below.' - You are not extending. You have a class that contains a dataframe. see https://www.packetflow.co.uk/python-inheritance-vs-composition/ — balderman, Aug 19 '20 at 09:38
True, I'm using composition. I'll reframe my question like that. It's just that I went by the title of pandas' help page. https://pandas.pydata.org/pandas-docs/stable/development/extending.html — najeem, Aug 19 '20 at 09:43
Do you want to avoid the calculation of `derived` in the case where `self.df` was not changed? — balderman, Aug 19 '20 at 09:53
Do you want to handle only the `derived` operation or do you wish to have a system that you can extend to some other operations on this dataframe ? — efont, Aug 19 '20 at 10:00
I have more than one derived property. As an example, i have shown only one. So a system is desirable. — najeem, Aug 19 '20 at 10:03
@najeem did you look here? https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-pandas — balderman, Aug 19 '20 at 10:41
@balderman, yes. I didn't find anything which will help my requirement in there. — najeem, Aug 19 '20 at 11:02
@najeem I agree :-( . I dont see any pandas callback that will let you know that the df data was modified. — balderman, Aug 19 '20 at 11:05
Why wouldn't you just create a copy of the dataframe into eg old_df (this becomes your `cache`) and then prior to calculating the sum check if df == old_df? Is there a reason you need something more complex? — kerasbaz, Aug 19 '20 at 12:29
This is a class the user will interact with his own code. So when will I take a copy of the dataframe? I'll have to tap into all the events which will modify the dataframe and then make a copy? If I knew that much, I can as well cache the data only when required. — najeem, Aug 20 '20 at 12:42
Can you provide an example of your data? How big is it, what are the dtypes? Also, do you control all methods/functions that might mutate the dataframe, or not? You could alternatively register data frames that have been mutated by functions — anon01, Aug 21 '20 at 17:26
@anon01 I have a sample in the question itself. In the actual problem i'm trying to solve, i have a 2 level multi index and 7 column of float data. It's actually a stress tensor. I calculate the eigen values (which are principal stresses) for this tensor as derived properties. All three eigen values are calculated in one go, however, the user will need only one at a time. So i'd like to cache the rest, in case the user asks for that later. However, it will not work if the dataframe has been changed between queries. The most common requirement itself runs into couple of million rows. — najeem, Aug 22 '20 at 11:12
If it's hard to *detect* changes, would it be feasible to *prevent* changes to the data frame? E.g., by setting `ndarray.flags.writeable` to False, for the NumPy ndarray that backs the data frame? — jsmart, Aug 25 '20 at 02:49
Even though the cases in which the user will modify the dataframe are few, I dont want to take away the possiblity completely. If I wanted the dataframe to be uneditable, I can make the dataframe a property without a `setter`. — najeem, Aug 25 '20 at 06:33
In case df is not changed after initialization (as in the example), you could use the built-in [`functools.cached_property`](https://docs.python.org/3/library/functools.html#functools.cached_property) decorator. — kadee, Apr 29 '21 at 08:26

efont · Answer 1 · 2020-08-20T16:48:03.670

4

If hashing doesn't work for you can try to take advantage of the internal state of your class.

Cache one method

Use a class attribute as a cache: on first call of the method, store the result into this attribute, and retrieve it on subsequent call.

import pandas as pd

class MyClass:
    def __init__(self, *args, **kwargs):
        self._df = pd.DataFrame(*args, **kwargs)
        self._cached_value = None

    @property
    def df(self):
        return self._df

    @df.setter
    def df(self, value):
        self._cached_value = None
        self._df = value

    @property
    def derived(self):
        if self._cached_value is None:
            self._cached_value = self._df.sum(axis=1)
        return self._cached_value

cl = MyClass()
cl.derived  # compute
cl.derived  # return cached value

cl.df = my_new_df_value  # cache is emptied
cl.derived  # compute

Cache several methods

You can then extend this principle to several methodes using a dict to store the result of each operation. You can use methods name as the keys to this dict (thanks to module inspect, see this response for an example).

import pandas as pd
import inspect

class MyClass:
    def __init__(self, *args, **kwargs):
        self.df = pd.DataFrame(*args, **kwargs)
        self._cached_values = {}

    @property
    def derived(self):
        method_name = self._get_method_name()
        if method_name not in self._cached_values:
            self._cached_value[method_name] = self.df.sum(axis=1)
        return self._cached_value[method_name]

    @property
    def derived_bis(self):
        method_name = self._get_method_name()
        if method_name not in self._cached_values:
            self._cached_value[method_name] = your_expensive_op
        return self._cached_value[method_name]

    def _get_method_name(self):
        return inspect.stack()[1][3]  # returns the name of this method's caller


cl = MyClass()
cl.derived  # compute  --> self._cached_value = {'derived': your_result}
cl.derived  # return cached value

cl.derived_bis # compute  --> self._cached_value = {'derived': your_result, 'derived_bis': your_other_result}
cl.derived_bis # return cached value

You can factorize the bodies of the two properties to respect the DRY principle, but be sure to modify _get_method_name accordingly.

edited Aug 20 '20 at 16:48

answered Aug 19 '20 at 12:39

efont

246
1
9

This will not work if the dataframe was changed in between subsequent calls to derived! for eg: `c1 = MyClass(); c1.derived; c1.df*=10; c1.derived` will give me the already cached data which is wrong. The code should know enough to throw away the cache when I modify the `df`. – najeem Aug 20 '20 at 12:48
Ah yes, I had not understood this was a requirement, my bad. But it is still possible to make it work if you empty the cache when updating the value of the dataframe. This can be done as a first step of your `setter` :) – efont Aug 20 '20 at 13:33
Exactly. So when does the class know 'now my dataframe has changed'? I was trying to achieve all this using `lru_cache`. however, it requires that i compute a hash value. I can set up a hash value for the dataframe based on easily computable stuffs like for eg: the sum of all values in the df. But it's not fool proof. Any decent hash value takes as much time as the `derived` property itself. – najeem Aug 20 '20 at 15:56
I have edited the first part of my answer to depict the full mechanism. Does it resemble what you were looking for ? If yes I will edit the rest of the answer accordingly. Also I don't think the `derived` methods should be `properties`, so I will remove them to be clearer. – efont Aug 20 '20 at 16:51
Setter will get called only when the dataframe is **set**. Not when it's modified. For eg: change the first column in the dataframe like `self.df[0] *= 5` will not trigger the setter. and hence, the derived property will be out of date and show errenous value. – najeem Aug 20 '20 at 19:10
1

I think the solution above + adding a hash check, ie (pandas.util.hash_pandas_object(df) -- on each call to derived would work. Hashing every single time is modest overhead, but if you need to detect changes I don't see another way. I don't think dataframes have an event model. – Doug F Aug 22 '20 at 07:12
Yeah, checking the hash everytime does not help because it will take almost similar time as the derived operation. I wanted to know if a dataframe inherently have some way to track changes. Probably not! – najeem Aug 22 '20 at 11:09
What about subclassing `pd.DataFrame` and overridding the `__setitem__` method so that on each call it updates an attribute `last_updated` which is a timestamp ? Then you store the result of your costly method in your class `MyClass` under this timestamp key: 1/ retrieve it if the dataframe hasn't changed, 2/ recompute if the timestamp is different. – efont Aug 22 '20 at 13:49

Bastien Harkins · Answer 2 · 2020-08-26T10:12:06.657

If you know which methods are likely to update your df, you could override them in your custom class, and keep a flag. I'm not going into details here, but here is the basic principle:

import numpy as np
import pandas as pd

class myclass:
    def __init__(self, *args, **kwargs):
        self.df = pd.DataFrame(*args, **kwargs)
        self.derived_is_calculated = False
        
    @property
    def derived(self):
        if not self.derived_is_calculated:
            d = self.df.sum(axis=1)
            self.derived_is_calculated = True
            return d

    def update(self, other, **kwargs):
        """ Implements the normal update method, and sets a flag to track if df has changed """
        old_df = self.df.copy()  # Make a copy for comparison
        pd.DataFrame.update(self.df, other, **kwargs) # Call the base'update' method
        if not self.df.equals(old_df): # Compare before and after update
            self.derived_is_calculated = False
        
random_array = np.random.randint(100, size=(2,10))
myobj = myclass(random_array)

print(myobj.derived) # Prints the summed df
print(myobj.derived) # Prints None

myobj.update([1,2,3])
print(myobj.derived) # Prints the new summed df

There is probably a deeper method of DataFrame or pandas that is called on every change in the DataFrame content, I'll keep looking.

But you could setup a list of methods that your program will use, and make a decorator to do basically what I did in update and call it on each one of the listed methods...

Thanks, but I dont know how the user will modify the dataframe. It's a regular pandas dataframe and I believe there are quite a lot of ways in which it can be modified. — najeem, Aug 26 '20 at 10:36
Actually I believe all updates to a `pd.DataFrame` go through the `__setitem__` method (though I didn't check thoroughly). — efont, Aug 27 '20 at 12:03

Hashing a pandas dataframe for calculated column caching

2 Answers2

Cache one method

Cache several methods