How to organize (many) methods that extract data from a few larger data sources?

Question

What I am trying to do is functionally easy enough to code, but I am looking for a clean and scalable way to organize this... "Beautiful is better than ugly."

I have an app where we are gathering information about an item from several different places. Once the raw data is gathered, we need to extract many attributes from this data.

The interface to the item's data should work like a simple dict, accessing the attributes with a friendly key name. I know I can accomplish this by subclassing one of the ABCs from the collections module.

My question is: How do I (cleanly) organize the many functions needed to extract the data attributes?

I could use cached properties... Break out the data sources and extraction functions into separate modules / classes...

I have considered putting all of them in a single class like follows:

class item_data(object):

    def __init__(self, item_name):
        self.name = item_name
        self._item_data = {'name': item_name}
        self._data_a = _get_data_from_source_a()
        self._data_b = _get_data_from_source_b()

    def _get_data_from_source_a(self):
        pass

    def _get_data_from_source_b(self):
        pass

    def _extract_attr_1(self):
        # Extract some data attribute from _data_a
        pass

    def _extract_attr_2(self):
        # Extract some data attribute from _data_a
        pass

    def _extract_attr_3(self):
        # Extract some data attribute from _data_b
        pass

    _attr_extract_methods = {
        'Attribute 1': _extract_attr_1,
        'Attribute 2': _extract_attr_2,
        'Attribute 3': _extract_attr_3,
    }

    def __getitem__(self, item):
        if item not in self._item_data:
            self._attr_extract_methods[item](self)
        return self._item_data[item]

If I wanted to break out the data sources (and the associated attribute extraction functions) into their own modules / classes, how can I do so in a clean and scalable fashion so that new data sources and attributes can be easily added later? Is there a way to enable the data source classes to register themselves and their associated attributes with the top-level class?

Note: This is being leaveraged in an app that is written to support both Python v2.7+ and v3.x.

Fomalhaut · Answer 1 · 2016-11-29T03:49:01.997

I think you could consider your source extracting as a separate logic and implement it as a separate class. This would look like this:

class DataSource(object):
    def load(self):
        # do loading

    def __getitem__(self, item):
        if item == "Attribute 1":
            return ...
        elif item == "Attribute 2":
            return ...
        elif ...


class ItemData(object):
    def __init__(self, name):
        self.name = name
        self._data = {'name': name}
        self._source = DataSource()
        self._source.load()

    def __getitem__(self, item):
        if item not in self._data:
            return self._source[item]
        else:
            return self._data[item]

And if you are supposed to have very different logic to load attribultes from your sources, maybe it's better to use a separate class for each attribute:

class BaseSourceAttribute(object):
    def load(self):
        raise NotImplementedError()


class Attribute1(BaseSourceAttribute):
    def load(self):
        return "1"


class Attribute2(BaseSourceAttribute):
    def load(self):
        return "2"


class DataSource(object):
    attributes = {
        'attr1': Attribute1(),
        'attr2': Attribute2()
    }

    def __init__(self):
        self._data = {}

    def load(self):
        for item, attr_obj in self.attributes.items():
            self._data[item] = attr_obj.load()

    def __getitem__(self, item):
        return self._data[item]


class ItemData(object):
    def __init__(self, name):
        self.name = name
        self._data = {'name': name}
        self._source = DataSource()
        self._source.load()

    def __getitem__(self, item):
        if item not in self._data:
            return self._source[item]
        else:
            return self._data[item]

For dynamic loading sources you can modify the class DataSource like this for example:

class DataSource(object):
    attributes = {
        'attr1': Attribute1(),
        'attr2': Attribute2()
    }

    def __init__(self):
        self._data = {}

    def load(self):
        pass

    def __getitem__(self, item):
        if item not in self._data:
            self._data[item] = self.attributes[item].load()
        return self._data[item]

I like where you are going with this recommendation... I do like breaking the sources out to their own classes with the related methods to extract the attributes from that source's data. However, I'm not sure I like having the interface `ItemData` class check every source for an attribute. It is certainly one way to do it. I am looking at some ways that I might be able to dynamically register the data source classes with the interface class and have them 'load' the attributes that they can provide. — cmlccie, Nov 28 '16 at 17:00
It's easy to implement the dynamic way to load data in the model `DataSource` if I understand you correctly. I have appended my answer. — Fomalhaut, Nov 29 '16 at 03:50
I am referring to having the DataSources dynamically register their attributes with the ItemData class, to improve scalability. Essentially I'm trying to minimize the changes needed when adding additional DataSources. I want 'all attributes' to be available through the ItemData class, with minimal work needed to add additional DataSources that provide additional attributes. — cmlccie, Dec 03 '16 at 04:35

How to organize (many) methods that extract data from a few larger data sources?

1 Answers1