2

I am using python 3, and have heavy variables in memory, I would like to seamlessly allowing myself to write them to disk, and load them when I actually need them, without the need for pickling and reading from disk explicitly any time I need them. Is that at all possible and how?

I tried RDFlip but this does not seem to work It's a store that you need to explicitly approach, and I am trying to make it seamless.

thebeancounter
  • 4,261
  • 8
  • 61
  • 109

3 Answers3

1

You might want to look into something like an Object Relational Mapping (ORM) library that lets you store the objects to a database, and retrieve them by using Python method/function calls (rather than SQL statements). SQLAlchemy is one of the most popular ORMs for python, and has tons of documentation and community support available online. You would have to do the "explicit" work you are talking about only once, when defining your database tables and configuring DB connectivity, etc for SQLAlchemy, But then after that, you could just use a single method call to write your variables to disk (in the DB), and another for retrieving them. And unlike pickle, you can store any binary object in your database, so you aren't limited as far as what kind of data you can serialize/store.

J. Taylor
  • 4,567
  • 3
  • 35
  • 55
  • I need it to be any variable, and i need to store to be seamless, just telling python or some object that in this part of code, All the variables are on disk. something that is more similar to dask's on disk shuffle when it comes to dataframes. – thebeancounter Feb 10 '19 at 08:12
1

have you tried HDF5. I think this will be something what you are looking for. HDF5

JAbr
  • 312
  • 2
  • 12
0

That's the thing, I need to it be flexible just like a regular python variable, i=4 and that's that

It seems you expect variable i stored in disk instead of memory and you want flexible way to store i like how i=4 syntax. Also you want any data type of variable.

Note that using assignment operator (=) always lead program to store variable in memory, therefore you need something else approach such as calling method of an object from class where the logic behind is store that to disk for example vardisk.set('i', 4) and you can access the variable with just this syntax vardisk.get('i')

You can do that with defining class first:

#@title VariableOnDisk
import pickle
import os

class VariableOnDisk():
  '''
  Save and load variable on disk.
  '''

  def __init__(self, storage_path="./var_disk/"):
    try:
      os.mkdir(storage_path)
    except:
      print('Storage path already exist, here is available variables:', os.listdir(storage_path))

    # We only need storage path
    self.storage_path = storage_path
  
  def set(self, variable_name, value):
    with open(os.path.join(self.storage_path, variable_name), 'wb') as f:
      pickle.dump(value, f)
  
  def get(self, variable_name):
    if os.path.exists(os.path.join(self.storage_path, variable_name)):
      with open(os.path.join(self.storage_path, variable_name), 'rb') as f:
        return pickle.load(f)
    else:
      raise NameError(f"name '{variable_name}' is not defined") # Same error when you try access variable that never defined.

I'm using pickle to store and load any object of variable to into file.

And this is example how you can use that class:

# Create instance of VariableOnDisk
vardisk = VariableOnDisk(storage_path='./var_disk/')

# Example to define variable 'i' to disk
vardisk.set('i', 4)

# Example to use variable 'i' from disk
print(vardisk.get('i'), type(vardisk.get('i')))

Output:

4 <class 'int'>

That's it, the code above is equal like this:

i = 4
print(i, type(i))

Here is another advanced class that has caching mechanism

class VariableOnDisk():
  '''
  Save and load variable on disk.
  '''

  def __init__(self, storage_path='./var_disk/'):
    # Make exception for this assignment of __setattr__
    self.___storage_path = storage_path 
    self.___cached_value = None
    self.___cached_varname = None

    try:
      os.mkdir(storage_path)
    except:
      print('Storage path already exists, here are available variables:', self)

  def __repr__(self):
    return str(set(os.listdir(self.___storage_path)))

  def __setattr__(self, varname, value):
    if '___' in varname:  # Call superclass's __setattr__ for constructor assignment
      super().__setattr__(varname, value)
    else:
      if self.___cached_value == value:
        print('Write was cached, skipped!')
        return
      else:
        with open(os.path.join(self.___storage_path, varname), 'wb') as f:
          self.___cached_value = value
          self.___cached_varname = None
          pickle.dump(value, f)

  def __getattr__(self, varname):
    variable_path = os.path.join(self.___storage_path, varname)
    if os.path.exists(variable_path):
      if self.___cached_varname == varname:
        print('Read was cached, using cached value!')
        return self.___cached_value
      else:
        self.___cached_varname = varname
        with open(variable_path, 'rb') as f:
          self.___cached_value = pickle.load(f)
          return self.___cached_value
    else:
      raise NameError(f"Variable on disk with name '{varname}' is not defined.") # Same error when you try to access a variable that was never defined.

Usage:

# Create instance of VariableOnDisk
vardisk = VariableOnDisk(storage_path='./var_disk/')

# Example to define variable 'i'
vardisk.i = 4

# Since it already defined with same value, it skipped.
vardisk.i = 4

# Example to use variable 'i'
print(vardisk.i)

# Since it already used, it will using cached value
print(type(vardisk.i))

# Example to show available variable name
print(vardisk)

I'm added assignment operator overloading (=) for with __setattr__.