Is Klepto (Python module) file_archive supposed to be 10x slower than Pickle?

Question

Paging @mike-mckerns I suppose, but I'd be grateful for answers from anyone with experience with the Klepto module for Python (https://pypi.org/project/klepto/).

My situation is that I'm running a simulation which involves generating and logging several tens of thousands of objects which are each a combination of strings and numerical values. (By which I just meant to say that these objects cannot be trivially abstracted into e.g. a numpy array.)

Due to the sheer number of objects generated, I ran into memory issues. My solution so far has been to use pickle to individually dump each instance to its own pickle file.

The problem is that this leaves my simulation data spread across a good 30k individual files (of about ~5kb size each). This is cumbersome when trying to move or share the data from past simulations, the total size is manageable but the number of individual files has been a problem.

This is why I ended up with Klepto as a possible solution. The file_archive function I thought would let me use a single file as my 'external' dictionary instead of needing to give every instance its own pickle file.

I don't understand much of the module very well, so I tried to implement it as simply as possible. My code basically works as follows:

from klepto.archives import file_archive
ExternalObjectDictionary = file_archive('data/EOD.pkl', {}, serialized=True, cached=False)
ObjectCounter = 0

class SimObject:
    
    def __init__(self):
        self.name = 'name'
        self.data1 = 100
        self.data2 = ['pear', 'apple', 'banana']
        #(Above values would be passed as arguments by the simulation)
        
        global ExternalObjectDictionary
        global ObjectCounter
        
        ObjectCounter += 1
        self.ID = ObjectCounter
        ExternalObjectDictionary[self.ID] = ObjectData(self.name, self.data1, self.data2)
        
        self.clear_data()
        
    def load_data(self):
        global ExternalObjectDictionary
        ObjData = ExternalObjectDictionary[self.ID]
        self.name = ObjData.name
        self.data1 = ObjData.data1
        self.data2 = ObjData.data2
        
    def clear_data(self):
        self.name = None
        self.data1 = None
        self.data2 = None
        
class ObjectData:
    
    def __init__(self, name, data1, data2):
        self.name = name
        self.data1 = data1
        self.data2 = data2
        
#Simulation would call data in a sequence as follows:
Obj1 = SimObject()

Obj1.load_data()

print(Obj1.name)

Obj1.clear_data()

When an Object is no longer needed, I destroy it simply with del ExternalObjectDictionary[x].

By itself, the implementation seems to work fine. EXCEPT, it ends up being something like a factor x10 or x20 slower than when I simply used pickle.dump() and pickle.load() on individual pickle files.

Am I using Klepto wrong, or is trying to dump/load from a single file simply inherently going to be this much slower than using individual files? I looked at a number of options, and Klepto seemed to offer the most straightforward read-from-file dictionary functionality for my needs, but perhaps I misunderstood how to use it?

Apologies if my code examples are simplified, I hope I've explained the issue clear enough for someone to respond and clear things up! If need be, I can continue using my current solution of 10k's of individual pickle files, but if an alternative method was possible that would be great!

`klepto` should be a bit slower, as it's doing more. However, your comparison is a bit imbalanced. You might try to use the same pickle file for all your objects... so just keep dumping to the same file, and then remember the order when you load sequentially. The issue is in part because you are using a single file. You might want to try a `klepto` `dir_archive` (a directory of files viewed as a dictionary). The `cached` keyword also makes a difference, as `cached=True` works with an in-memory copy, until you `dump` the archive to disk -- while `cached=False` works directly with the file. — Mike McKerns, Aug 26 '21 at 11:04
Hi @MikeMcKerns thanks for the response! I'll look closer at the solutions you suggest, but at first blush it seems like they may not be appropriate for me? What I'm trying to do is have a single-file dictionary WITHOUT having to ever load the complete dictionary into memory (as the full dictionary would amount to ~500MB across 10k entries). Is there a way to do that in Klepto, or is my demand just unreasonable? Thanks again! — nadafan boy, Aug 26 '21 at 13:20
That is what the `cached=True` keyword is for. You have the in-memory copy which can dump/load individual keys from the file. — Mike McKerns, Aug 27 '21 at 11:05

Is Klepto (Python module) file_archive supposed to be 10x slower than Pickle?

0 Answers0