Paging @mike-mckerns I suppose, but I'd be grateful for answers from anyone with experience with the Klepto module for Python (https://pypi.org/project/klepto/).
My situation is that I'm running a simulation which involves generating and logging several tens of thousands of objects which are each a combination of strings and numerical values. (By which I just meant to say that these objects cannot be trivially abstracted into e.g. a numpy array.)
Due to the sheer number of objects generated, I ran into memory issues. My solution so far has been to use pickle to individually dump each instance to its own pickle file.
The problem is that this leaves my simulation data spread across a good 30k individual files (of about ~5kb size each). This is cumbersome when trying to move or share the data from past simulations, the total size is manageable but the number of individual files has been a problem.
This is why I ended up with Klepto as a possible solution. The file_archive function I thought would let me use a single file as my 'external' dictionary instead of needing to give every instance its own pickle file.
I don't understand much of the module very well, so I tried to implement it as simply as possible. My code basically works as follows:
from klepto.archives import file_archive
ExternalObjectDictionary = file_archive('data/EOD.pkl', {}, serialized=True, cached=False)
ObjectCounter = 0
class SimObject:
def __init__(self):
self.name = 'name'
self.data1 = 100
self.data2 = ['pear', 'apple', 'banana']
#(Above values would be passed as arguments by the simulation)
global ExternalObjectDictionary
global ObjectCounter
ObjectCounter += 1
self.ID = ObjectCounter
ExternalObjectDictionary[self.ID] = ObjectData(self.name, self.data1, self.data2)
self.clear_data()
def load_data(self):
global ExternalObjectDictionary
ObjData = ExternalObjectDictionary[self.ID]
self.name = ObjData.name
self.data1 = ObjData.data1
self.data2 = ObjData.data2
def clear_data(self):
self.name = None
self.data1 = None
self.data2 = None
class ObjectData:
def __init__(self, name, data1, data2):
self.name = name
self.data1 = data1
self.data2 = data2
#Simulation would call data in a sequence as follows:
Obj1 = SimObject()
Obj1.load_data()
print(Obj1.name)
Obj1.clear_data()
When an Object is no longer needed, I destroy it simply with del ExternalObjectDictionary[x]
.
By itself, the implementation seems to work fine. EXCEPT, it ends up being something like a factor x10 or x20 slower than when I simply used pickle.dump()
and pickle.load()
on individual pickle files.
Am I using Klepto wrong, or is trying to dump/load from a single file simply inherently going to be this much slower than using individual files? I looked at a number of options, and Klepto seemed to offer the most straightforward read-from-file dictionary functionality for my needs, but perhaps I misunderstood how to use it?
Apologies if my code examples are simplified, I hope I've explained the issue clear enough for someone to respond and clear things up! If need be, I can continue using my current solution of 10k's of individual pickle files, but if an alternative method was possible that would be great!