Best way to store fixed-keys key:value datasets that are accessed by keys in python?

Question

What I want is to be able to handle sets of data that have a fixed set of keys. All keys are strings. The data will never be edited. I know this can be done with normal dicts like so:

data_a = {'key1': 'data1a', 'key2': 'data2a', 'key3': 'data3a'}
data_b = {'key1': 'data1b', 'key2': 'data2b', 'key3': 'data3b'}
data_c = {'key1': 'data1c', 'key2': 'data2c', 'key3': 'data3c'}

They must be able to be called like so:

data_a['key1'] # Returns 'data1a'

However, this looks to be a waste of memory (since dictionaries apparently keep themselves 1/3 empty or something like that, along with also storing the keys multiple times) and also tedious to create as well since I need to keep entering the same keys over and over again in my code. I also risk accidentally changing something in the datasets.

My current solution is to have a set of keys stored in a tuple first, then store the data as tuples too. It looks like this:

keys = ('key1', 'key2', 'key3')
data_a = ('data1a', 'data2a', 'data3a')
data_b = ('data1b', 'data2b', 'data3b')
data_c = ('data1b', 'data2c', 'data3c')

To retrieve data, I would do this:

data_a[keys.index('key1')] # Returns 'data1a'

Then, I learned about this thing called namedtuples which seem to be able to do what I needed:

import collections
Data = collections.namedtuple('Data', ('key1', 'key2', 'key3'))
data_a = Data('data1a', 'data2a', 'data3a')
data_b = Data('data1b', 'data2b', 'data3b')
data_c = Data('data1b', 'data2c', 'data3c')

However, it appears I can't simply call the value by the key. Instead, to retrieve the data by the key, I have to use getattr, which doesn't seem very intuitive:

getattr(data_a,'key1') # Returns 'data1a'

My criteria is for memory efficiency first, then performance efficiency. Of these 3 methods, which would be the best way to do things? Or am I missing something and there's a more pythonic idiom to get what I want?

EDIT: I've now recently also learned about the existence of __slots__, which apparently runs more efficiently for key:value pairs while pretty much consuming the same(?) amount of memory. Would an implementation acting similar to this be a suitable alternative to namedtuples?

An orthogonal suggestion, have you looked at [pandas](http://pandas.pydata.org/) ? — tacaswell, Jan 13 '13 at 07:59

Lev Levitsky · Answer 1 · 2013-01-13T07:37:29.720

1

namedtuple seems the right thing to use. If your "keys" are fixed, you don't need getattr and can use the normal syntax for retrieving objects' attributes:

In [1]: %paste
import collections
Data = collections.namedtuple('Data', ('key1', 'key2', 'key3'))
data_a = Data('data1a', 'data2a', 'data3a')
data_b = Data('data1b', 'data2b', 'data3b')
data_c = Data('data1b', 'data2c', 'data3c')

## -- End pasted text --

In [2]: data_a.key1
Out[2]: 'data1a'

This usage is also demonstrated in the docs:

>>> # Basic example
>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(11, y=22)     # instantiate with positional or keyword arguments
>>> p[0] + p[1]             # indexable like the plain tuple (11, 22)
33
>>> x, y = p                # unpack like a regular tuple
>>> x, y
(11, 22)
>>> p.x + p.y               # fields also accessible by name
33
>>> p                       # readable __repr__ with a name=value style
Point(x=11, y=22)

You don't usually use getattr if the second argument (attribute name) is constant. It's only needed if it may change:

In [3]: attr = input('Attribute: ')
Attribute: key3

In [4]: getattr(data_b, attr)
Out[4]: 'data3b'

edited Jan 13 '13 at 07:37

answered Jan 13 '13 at 07:29

Lev Levitsky

63,701
20
147
175

Well, the problem is that the values to be retrieved may vary, so rather than actually entering 'key1', its a variable that stores a key (which is a string). So getattr() is necessary. Probably my mistake to not show that in the example. Unless if it's possible to edit the `__getitem__` method of the namedtuple? – Eric Jan 13 '13 at 07:42
@Eric `__getitem__` already does a sane thing for `namedtuple`, behaving as in the regular tuple (see the example from the docs). But you can subclass it and make it call `getattr` instead. Saving you typing a couple of characters later is a quite arguable reason for doing that, though. – Lev Levitsky Jan 13 '13 at 07:45
What's the memory footprint of a namedtuple compared to a dict? – Paul Hankin Jan 13 '13 at 08:46
From documentation, namedtuples have the same memory overhead as normal tuples. dicts are basically huge, since they only try to keep themselves 2/3 filled from what I've read, while tuples are fixed memory once defined so should be much more efficient. – Eric Jan 13 '13 at 09:38
If you want to know how much memory an object is consuming use the `__sizeof__()` method. For example a `nametuple` of 3 elements uses 48 bytes while a `dict` containing three elements uses 248 bytes(plus the memory used by the keys and the values, in both of them). Anyway I'd suggest to first [profile](http://stackoverflow.com/questions/110259/python-memory-profiler) the memory usage. – Bakuriu Jan 13 '13 at 09:51

score 1 · Accepted Answer · answered Jan 13 '13 at 10:34

Yes, __slots__ should do.

class Data:
    __slots__ = ["key1", "key2"]

    def __init__(self, k1, k2):
        self.key1, self.key2 = k1, k2

    def __getitem__(self, key):
        if key not in self.__slots__:
            raise KeyError("%r not found" % key)
        return getattr(self, key)

Let's try that out:

>>> Data(1, 2)["key1"]
1

The conditional on key not in self.__slots__ is a sanity check; getattr would happily fetch __init__ for us if it weren't present.

Best way to store fixed-keys key:value datasets that are accessed by keys in python?

2 Answers2