How to (lazy) data load the same big dataset from multiple python modules

Question

I have multiple (python) modules where I use the same input data and also the variables have the same names.

I created a module data_loading.py where the variables are instantiated. I then import the variables I need in the data_analysis_xx modules.

For example,

" Module data_analysis_1 "
from data_loading import var_1, var_2,…, var_k 

" Module data_analysis_2 "
from data_loading import var_1, var_3

In this way I avoid to copy-and-paste the same 200 lines of code in every module to load the same or partially the same set of data

First question:

is using a single source module for data loading the right approach? Is there a standard way or anyway a better way for importing the same variables in multiple modules?

Problem:

However when I import data_loading, all the variables in it are loaded/processed even if I actually import only one or few variables. This can be time consuming, especially because in data_loading I also do some basic data manipulation (check, split, cut, sort, etc.)

Second question:

how to make the data_loading module work such that only the variables that really need to be loaded/processed are actually processed?

Possible solutions

split data_loading in multiple sub-modules --> slightly reduces the problem but increases the number of files to load from: complexity, caos, error prone. Not good
create a class that deals with the data loading and loads only the variable via the class? How to do this practically? How do I import the variables then? "from data_loading import Loader.var_1 as var_1 …, Loader.var_k as var_k" ?
implement lazy loading --> Many of my variables are classes that deal with the actual data loading (retrieving the data from a file). Hence, lazy loading would help in reducing the total cost in terms of time.

References:
- How to lazy load a data structure (python)
- cached functions via descriptor (descriptor: https://docs.python.org/3/howto/descriptor.html#descriptor-protocol)
- https://github.com/pallets/werkzeug/blob/10b4b8b6918a83712170fdaabd3ec61cf07f23ff/werkzeug/utils.py
- https://stackoverflow.com/a/6849299/7074426

related: https://stackoverflow.com/questions/354883/how-do-i-return-multiple-values-from-a-function?rq=1 — Robyc, Nov 03 '19 at 20:46

score 2 · Answer 1 · edited Nov 04 '19 at 09:09

One way to deal with all your issues is to create a factory that will create singleton in demand.

With this two design pattern you can instantiate class and create variable only when needed (factory) and you can reuse them without compute them again in several module (singleton)

Edit 1

here is a pratical application with pseudocode exemple.

In my script I need variable in order to cumpute something. These variables are : pi with a define number of floating point, fibonacci of several number and prime number.

I create a script with functions to compute each variables I need, but each function have a cache that will capture the computed value.

#compute_variables.py

def pi(n):
    pass

def fibonaci(n):
    pass

def prime_number(n):
    pass

All of these function are calculation intensive and should be only vompute once per variables .

Now I need class that will return the desired vaibales allready compute in a singleton (only one object is return everytime)

#variables.py

class Calculus1:

    def __new__(cls):
        #singleton logic

    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

    def var1(self):
        return pi(self.a)

    def var2(self):
        return fibonaci(self.b)

    def var3(self):
        return prime_number(self.c)   

class Calculus2:

    def __new__(cls):
        #singleton logic

    def __init__(self, a, b, e):
        self.a = a
        self.b = b
        self.e = e

    def var1(self):
         return pi(self.a)

    def var2(self):
        return fibonaci(self.b)

    def var3(self):
        return prime_number(self.e)

I have the bear bone, know I need a factory that return singleton of variables class .

# factory_meth.py

def  factory(var_class_name):
    # Factory logic here 
    myVars = CalculusX(x, y, z)
    return myVars

And finaly I used all this implementation in my application .

from  factory_meth import factory

def fancy_something_1(f, g):
    vars = factory("Calculus2")
    return      (f * vars.vars1 / (vars.vars2 + vars.vars3))**g

def fancy_something_2(z ,h):
    vars = factory("Calculus2")
    return  z + vars.vars1 vars.vars2 + vars.vars3 + h

With this logic, all variables will be call and compute only when needed (factory), where even if a variables is recall, there will be no computation as its the same object return (singleton).

Note

The above architecture is one way to achieve lazy load and on demand call, but the design may be adjust to feat your need.

what do you mean in practice? In Solution 2 I suggest to use a class. Can this be the factory you mention? — Robyc, Oct 31 '19 at 11:54
Thank you. The indentation in Calculus2 is screwed, right? I don't think that's on purpose — Robyc, Nov 02 '19 at 16:01

score 1 · Answer 2 · answered Oct 31 '19 at 12:19

You could simply move the actions associated with each dataset inside of a function, and then import that function. This could be a single function that takes a parameter to select the dataset, or multiple functions, perhaps one function per dataset.

The one function per dataset case might look like:

def load_val_1():
    val_1 = ...  # Create and preprocess val_1 here
    ...
    return val_1

Then you would from data_loading import load_val_1 and create val_1 by calling the function you imported. The functions could even be static methods or class methods of a class, in which case you would only need to import that class.