0

I have defined a custom class in Python as a dictionary of dataframes:

class DictOfDF(dict):

    __slots__ = ()

    def __init__(self, *args, **kwargs):
        super(DictOfDF, self).__init__(*args, **kwargs)

    def __getitem__(self, key):
        return super(DictOfDF, self).__getitem__(key)

    def __setitem__(self, key, value):
        return super(DictOfDF, self).__setitem__(key, value)

    def __delitem__(self, key):
        return super(DictOfDF, self).__delitem__(key)

    def get(self, key, default=None):
        return super(DictOfDF, self).get(key, default)

    def setdefault(self, key, default=None):
        return super(DictOfDF, self).setdefault(key, default)

    def pop(self, key, value):
        return super(DictOfDF, self).pop(key, value)

    def update(self, *args, **kwargs):
        super(DictOfDF, self).update(*args, **kwargs)

    def __contains__(self, key):
        return super(DictOfDF, self).__contains__(key)

    def copy(self):
        return type(self)(self)

    def __repr__(self):
        return '{0}({1})'.format(type(self).__name__, super(DictOfDF, self).__repr__())

To avoid a discussion of whether or not subclassing from dict is preferable to subclassing from UserDict etc., note that the above is inspired by the answer here: https://stackoverflow.com/a/39375731/19682557

I want to define a 'loc' property for this DictOfDF class such that:

import pandas as pd
import datetime as dt

class DictOfDF(dict):

    ...

x = DictOfDF({'x1': pd.DataFrame(np.nan, index=pd.date_range(dt.datetime(2000, 1, 1), dt.datetime(2000, 12, 31)),
                                 columns=['a', 'b', 'c']),
              'x2': pd.DataFrame(np.nan, index=pd.date_range(dt.datetime(2000, 1, 1), dt.datetime(2000, 12, 31)),
                                 columns=['a', 'b', 'c'])})

# x.loc['2000-03':'2000-04',['a','b']] should return a DictOfDF whose two dataframes are subsetted to the date range 2000-03/2000-04 and columns 'a' and 'b'

My idea would be to add a property like the following to the class definition, however this doesn't seem to work

class DictOfDF(dict):
    
    ...
    
    @property
    def loc(self):
        return DictOfDF({key: value._LocIndexer for key, value in self.items()})

I get the error

AttributeError: 'DataFrame' object has no attribute '_LocIndexer'

I feel that I am on the right track, but any suggestions for fixing this would be much appreciated. Knowing how to define a similar 'iloc' property would also be useful, in case the custom implementation of this is materially different to 'loc'.

tonkotsu
  • 3
  • 2

1 Answers1

0

Nice job on using dict comprehension! You were almost there, you just have to "forward" the call to all DataFrame objects:

class DictOfDF(dict):
    @property
    def loc(self):
        locs = {k: v.loc for k, v in self.items()}

        class Forwarder:
            def __getitem__(self, item):
                d = {k: v.__getitem__(item) for k, v in locs.items()}
                return DictOfDF(d)

        return Forwarder()

The implementation consists of two steps:

  1. Get an indexer for each dataframe via the .loc attribute
  2. Get the dataframe subsets using the acquired indexers

Getting the indexers

Accessing DataFrame.loc without a key returns its indexer:

>>> x["x1"].loc
<pandas.core.indexing._LocIndexer object at ...>

Your code raises an AttributeError because _LocIndexer isn't a DataFrame attribute, it's a class defined in the pandas.core.indexing module.

Using dict comprehension, we get a dict of indexers:

{'x1': <pandas.core.indexing._LocIndexer object at ...>, 'x2': <pandas.core.indexing._LocIndexer object at ...>}

Getting the subsets

To support the .loc[key] syntax, we define a class that implements __getitem__. From the Python docs:

Called to implement evaluation of self[key]

So the statement x.loc[...] is equivalent to x.loc.__getitem__(...). We can use this to call __getitem__ for each dataframe using another dict comprehension:

{k: v.__getitem__(item) for k, v in locs.items()}

This returns the desired dataframe subsets.

Why the nested class?

Here, the nested class might be a necessary evil. We need a place where we can implement the __getitem__ method and a nested class was the first thing that occured to me.

Side notes

  1. This implementation only supports read access. To provide write access, you also need to implement __setitem__.

  2. Implementing this kind of dict/dataframe hybrid feels "hacky". Maybe there's another way to model your data (e.g. MultiIndex instead of dict of dataframes)?

  3. You probably already know this, but it's considered bad practice to access internals like pandas.core.indexing._LocIndexer. This is because they aren't part of the public API and can, consequently, change without notice. So you can't really depend on _LocIndexer being there.

  • Excellent, thanks! I also liked your suggestion of using MultiIndex DataFrames and have implemented that instead. If we have class MyDF(DF): ..., is there a way to ensure that for x of class MyDF x.loc[...,...] is also of class MyDF, so that x.loc[...,...].mydf_method() can work properly? Currently x.loc[...,...] defaults to a DataFrame, which prevents this. – tonkotsu Aug 04 '22 at 16:54
  • @tonkotsu In Python, virtually everything is possible. You could, for example, let `MyDF` inherit from `DataFrame` and override the `loc` method. But at this stage I wonder if this is the way forward. Inheritance may seem like the "easy way" in the beginning, but can lead to complications in the future. An alternate approach would be to write a couple of functions that *work with* a `DataFrame` instead of *being one* – Anton Yang-Wälder Aug 05 '22 at 06:50
  • Thanks for your answer, I've asked a separate question on this topic so as not to bog down this comments section. As I will be using quite specific DataFrame structures (i.e. MultiIndexing with specific column names) and will probably want to manipulate/subset the data in specific ways idiosyncratic to the type of data being looked at, I am slightly reluctant to define a function specifically intended for this structure of DataFrame which would then break/not make sense for general DataFrames. – tonkotsu Aug 05 '22 at 09:39