0

In my application, I need a fast look up of attributes. Attributes are in this case a composition of a string and a list of dictionaries. These attributes are stored in a wrapper class. Let's call this wrapper class Plane:

class Plane(object):
    def __init__(self, name, properties):
        self.name = name
        self.properties = properties

    @classmethod
    def from_idx(cls, idx):
        if idx == 0:
            return cls("PaperPlane", [{"canFly": True}, {"isWaterProof": False}])
        if idx == 1:
            return cls("AirbusA380", [{"canFly": True}, {"isWaterProof": True}, {"hasPassengers": True}])

To better play with this class, I added a simple classmethod to construct instances by providing and integer.

So now in my application I have many Planes, of the order of 10,000,000. Each of these planes can be accessed by a universal unique id (uuid). What I need is a fast lookup: given an uuid, what is the Plane. The natural solution is a dict. A simple class to generate planes with uuids in a dict and to store this dict in a file may look like this:

class PlaneLookup(object):
    def __init__(self):
        self.plane_dict = {}

    def generate(self, n_planes):
        for i in range(n_planes):
            plane_id = uuid.uuid4().hex
            self.plane_dict[plane_id] = Plane.from_idx(np.random.randint(0, 2))

    def save(self, filename):
        with gzip.open(filename, 'wb') as f:
            pickle.dump(self.plane_dict, f, pickle.HIGHEST_PROTOCOL)

    @classmethod
    def from_disk(cls, filename):
        pl = cls()
        with gzip.open(filename, 'rb') as f:
            pl.plane_dict = pickle.load(f)
        return pl

So now what happens is that if I generate some planes?

pl = PlaneLookup()
pl.generate(1000000)

What happens is, that lots of memory gets consumed! If I check the size of my pl object with the getsize() method from this question, I get on my 64bit machine a value of 1,087,286,831 bytes. Looking at htop, my memory demand seems to be even higher (around 2GB). In this question, it is explained quite well, why python dictionaries need much memory.

However, I think this does not have to be the case in my application. The plane object that is created in the PlaneLookup.generate() method contains very often the same attributes (i.e. the same name and the same properties). So it has to be possible, to save this object once in the dict and whenever the same object (same name, same attribute) is created again, only a reference to the already existing dict entry is stored. As a simple Plane object has a size of 1147 bytes (according to the getsize() method), just saving references may save a lot of memory!

The question is now: How do I do this? In the end I need a function that takes a uuid as an input and returns the corresponding Plane object as fast as possible with as little memory as possible. Maybe lru_cache can help?

Here is again the full code to play with: https://pastebin.com/iTZyQQAU

martineau
  • 119,623
  • 25
  • 170
  • 301
Merlin1896
  • 1,751
  • 24
  • 39
  • Are planes supposed to be treated as immutable? – user2357112 Apr 25 '18 at 17:34
  • 1
    Can you use a small database? – Grant Williams Apr 25 '18 at 17:36
  • `Plane` doesn't appear to need to be a class, as you don't have any methods. Representing each `Plane` as a simple tuple may be a better idea. – chepner Apr 25 '18 at 17:37
  • I would suggest using Python's [shelve](https://docs.python.org/3/library/shelve.html#module-shelve) module which provides an dictionary-like interface to a file-based database—so doesn't require it all to be in memory. – martineau Apr 25 '18 at 17:45
  • 1
    Why are you storing a list of single-element `dict` objects for the `properties` attribute??? That is perhaps the least useful data structure you could possible come up with. Just use a single `dict`, or better yet, make them *individual attributes* on your object (which pretty much is a dict anyway...) – juanpa.arrivillaga Apr 25 '18 at 17:47
  • @juanpa.arrivillaga Sure, this structure is far from optimal. But I guess this would not change much – Merlin1896 Apr 25 '18 at 18:01
  • @GrantWilliams A database could work if the lookup is fast. This is very critical in my application. Ideally, it should be as fast as having the whole dict in memory – Merlin1896 Apr 25 '18 at 18:02
  • @user2357112 Immutable would be fine, I guess. – Merlin1896 Apr 25 '18 at 18:17

1 Answers1

1

Did you think about having another dictionary with idx -> plane? then in self.plane_dict[plane_uuid] you would just store idx instead of object. this will save memory and speed up your app, though you'd need to modify the lookup method.

nefo_x
  • 3,050
  • 4
  • 27
  • 40
  • My example was bad here, i guess. This is not possible, since there might be planes which were not generated by an index but by the class constructor. The indices are just a helper function. – Merlin1896 Apr 25 '18 at 17:59
  • 1
    then try to have a registry of all planes generates and store new indexes by instance. or do make use of `__hash__` function and store planes in the registry only if the hash is not present and store the hash in `plane_dict`. – nefo_x Apr 26 '18 at 09:05