How to lazy load a data structure (python)

Question

I have some way of building a data structure (out of some file contents, say):

def loadfile(FILE):
    return # some data structure created from the contents of FILE

So I can do things like

puppies = loadfile("puppies.csv") # wait for loadfile to work
kitties = loadfile("kitties.csv") # wait some more
print len(puppies)
print puppies[32]

In the above example, I wasted a bunch of time actually reading kitties.csv and creating a data structure that I never used. I'd like to avoid that waste without constantly checking if not kitties whenever I want to do something. I'd like to be able to do

puppies = lazyload("puppies.csv") # instant
kitties = lazyload("kitties.csv") # instant
print len(puppies)                # wait for loadfile
print puppies[32]

So if I don't ever try to do anything with kitties, loadfile("kitties.csv") never gets called.

Is there some standard way to do this?

After playing around with it for a bit, I produced the following solution, which appears to work correctly and is quite brief. Are there some alternatives? Are there drawbacks to using this approach that I should keep in mind?

class lazyload:
    def __init__(self,FILE):
        self.FILE = FILE
        self.F = None
    def __getattr__(self,name):
        if not self.F: 
            print "loading %s" % self.FILE
            self.F = loadfile(self.FILE)
        return object.__getattribute__(self.F, name)

What might be even better is if something like this worked:

class lazyload:
    def __init__(self,FILE):
        self.FILE = FILE
    def __getattr__(self,name):
        self = loadfile(self.FILE) # this never gets called again
                                   # since self is no longer a
                                   # lazyload instance
        return object.__getattribute__(self, name)

But this doesn't work because self is local. It actually ends up calling loadfile every time you do anything.

@ephemient: it looks like a call to `__nonzero__` will go through `__getattr__`, so what's the problem? — Anton Geraschenko, Dec 30 '10 at 00:03
I guess it is a bit better to say `if self.F is not None` rather than `if not self.F` since `loadfile(FILE)` *might* be 0 or the empty string. — Anton Geraschenko, Dec 30 '10 at 00:03

Lennart Regebro · Answer 1 · 2010-12-30T06:02:48.103

6

The csv module in the Python stdlibrary will not load the data until you start iterating over it, so it is in fact lazy.

Edit: If you need to read through the whole file to build the datastructure, having a complex Lazy load object that proxies things is overkill. Just do this:

class Lazywrapper(object):
    def __init__(self, filename):
        self.filename = filename
        self._data = None

    def get_data(self):
        if self._data = None:
            self._build_data()
        return self._data

    def _build_data(self):
        # Now open and iterate over the file to build a datastructure, and
        # put that datastructure as self._data

With the above class you can do this:

puppies = Lazywrapper("puppies.csv") # Instant
kitties = Lazywrapper("kitties.csv") # Instant

print len(puppies.getdata()) # Wait
print puppies.getdata()[32] # instant

Also

allkitties = kitties.get_data() # wait
print len(allkitties)
print kitties[32]

If you have a lot of data, and you don't really need to load all the data you could also implement something like class that will read the file until it finds the doggie called "Froufrou" and then stop, but at that point it's likely better to stick the data in a database once and for all and access it from there.

edited Dec 30 '10 at 06:02

answered Dec 29 '10 at 23:35

Lennart Regebro

167,292
41
224
251

I'm not actually dealing with `csv` files; it was just for purposes of illustration. But this is good to know. Also, `loadfile` will have to iterate over the data if it is to build some interesting nonstandard data structure out of it, so laziness of `csv` may not be enough. – Anton Geraschenko Dec 29 '10 at 23:41
If I understand correctly, your `self._data` is exactly analogous to my `self.F` and your `_build_data` is exactly analogous to my `loadfile`. The nice thing about using `__getattr__` is that you don't have to write `.get_data()` everywhere; you can interact with the wrapper *exactly* as you would with the data structure directly. – Anton Geraschenko Dec 30 '10 at 17:37
Yeah, but then you need more then `__getattr__`, really, you need to write a complete proxy, and it's not worth it. Simple = good. And you don't have to write get_data() everywhere. You just write data = thewrapper.get_data(), so you only write it once for every function where you need to access the data. – Lennart Regebro Dec 30 '10 at 18:51
Wouldn't writing `data = thewrapper.get_data()` defeat the purpose of lazy loading? Why is `__getattr__` not enough in principle? – Anton Geraschenko Jan 01 '11 at 00:24
No, it wouldn't defeat the purpose. It would load the data, so would calling __gettattr__. In both cases all the data gets loaded the first time you want any of it. – Lennart Regebro Jan 01 '11 at 07:16

S.Lott · Answer 2 · 2011-01-01T00:59:11.530

2

If you're really worried about the if statement, you have a Stateful object.

from collections import MutableMapping
class LazyLoad( MutableMapping ):
   def __init__( self, source ):
       self.source= source
       self.process= LoadMe( self )
       self.data= None
   def __getitem__( self, key ):
       self.process= self.process.load()
       return self.data[key]
   def __setitem__( self, key, value ):
       self.process= self.process.load()
       self.data[key]= value
   def __contains__( self, key ):
       self.process= self.process.load()
       return key in self.data

This class delegates the work to a process object which is either a Load or a DoneLoading object. The Load object will actually load. The DoneLoading will not load.

Note that there are no if-statements.

class LoadMe( object ):
   def __init__( self, parent ):
       self.parent= parent
   def load( self ):
       ## Actually load, setting self.parent.data
       return DoneLoading( self.parent )

class DoneLoading( object ):
   def __init__( self, parent ):
       self.parent= parent
   def load( self ):
       return self

edited Jan 01 '11 at 00:59

answered Dec 30 '10 at 00:41

S.Lott

384,516
81
508
779

I don't understand this answer (yet?). It seems like once the object is actually loaded, the only available methods are `get` and `put`. But there are lots of other ways I might want to interact with the data structure. For example, `puppies[34]` and `len(puppies)` would each throw an `AttributeError` using this method, right? – Anton Geraschenko Dec 30 '10 at 01:14
@Anton Geraschenko: "But there are lots of other ways I might want to interact with the data structure." Good. Write more methods. I could only think of two. You can think of more. Write more. – S.Lott Dec 30 '10 at 03:21
Thanks for clarifying. Now I understand the answer, but it's strictly worse than the solution I proposed in the question, which makes *every* method available for *any* imaginable data structure. `__getattr__` is a catch-all method that makes long lists of near identical method definitions unnecessary. – Anton Geraschenko Dec 30 '10 at 17:18
@Anton Geraschenko: "but it's strictly worse than the solution I proposed". Are you unable to merge the two? I would think that you could use a **State** design to eliminate the if-statement, coupled with your `__getattr__`-based design. I'm quite sure that answers are not all-or-nothing propositions, but more like suggestions or hints. – S.Lott Dec 30 '10 at 19:53
Ah, I think I understand better now, though I still don't understand why you want to eliminate the `if`. I guess it's just in case the data structure is `None`, or to save the time of testing if something is `None` (?). The reason I thought the (not working) code in the question would be "even better" is that after the data structure is loaded, the object would be *completely indistinguishable* from an object which was loaded directly, down to the point that it would no longer be an instance of the wrapper class. – Anton Geraschenko Dec 31 '10 at 01:29
@Anton Geraschenko: this object would be completely indistinguishable. The "wrapper" class is the relevant object. No change. The only thing that changes is an internal state that is invisible to the API. – S.Lott Dec 31 '10 at 01:42
The object would be *largely* indistinguishable, but not completely. The wrapper is an instance of `LazyLoad`, but if loaded directly, it would be an instance of `dict` (or whatever). The wrapper has `source` and `process` attributes, which the data structure does not. This is normally irrelevant, but note in the [solution I gave](http://stackoverflow.com/questions/4558763#4559109), after the wrapper is called once, it actually *ceases to be a wrapper*. It becomes the data structure directly, in the same way that your `process` stops being a `LoadMe` and becomes a `DoneLoading`. – Anton Geraschenko Dec 31 '10 at 04:16
Of course that solution also requires me to pass the variable name as an argument and implicitly assumes it's global. It would be nicer if there were a way to directly reassign `self` within a method. – Anton Geraschenko Dec 31 '10 at 04:19
@Anton Geraschenko: "requires me to pass the variable name as an argument"? How does `__getattr__` require that? I thought that `__getattr__` provided access to the variables without using the name as an argument to a method. What am I missing? – S.Lott Dec 31 '10 at 14:19
@Anton Geraschenko: "The wrapper is an instance of LazyLoad, but if loaded directly, it would be an instance of dict (or whatever)." False and False. First, `LazyLoad` can be a subclass of `dict`. Second, a "direct load" doesn't make sense. Once you have a lazy loader, you'd never use anything else. Also, if you want your "lazy loader" to be the application object, that's trivial. Lazy Loading is just a feature of the application object. Since you provided zero guidance on what else you want this to do, I cannot foresee all the application features you want. Add them yourself please. – S.Lott Dec 31 '10 at 14:21
It seems like we're having serious trouble communicating. I'm either not getting through to you or you don't see why anybody would want a lazy loader class that agnostic about what data structure it's loading and is completely faithful (in the sense that it behaves exactly like the data structure it's loading, but has *no features of its own*). Sorry for wasting your (and my) time. – Anton Geraschenko Jan 01 '11 at 00:22
@Anton Geraschenko: "agnostic about what data structure it's loading" is absurd. You cannot be "agnostic" about the class which is being loaded. Unless you have something else in mind that you're not putting in the question. **Update the Question** to contain **all** of your requirements. – S.Lott Jan 01 '11 at 01:00

randomhuman · Answer 3 · 2013-02-02T14:13:22.703

Here's a solution that uses a class decorator to defer initialisation until the first time an object is used:

def lazyload(cls):
    original_init = cls.__init__
    original_getattribute = cls.__getattribute__

    def newinit(self, *args, **kwargs):
        # Just cache the arguments for the eventual initialization.
        self._init_args = args
        self._init_kwargs = kwargs
        self.initialized = False
    newinit.__doc__ = original_init.__doc__

    def performinit(self):
        # We call object's __getattribute__ rather than super(...).__getattribute__
        # or original_getattribute so that no custom __getattribute__ implementations
        # can interfere with what we are doing.
        original_init(self,
                      *object.__getattribute__(self, "_init_args"),
                      **object.__getattribute__(self, "_init_kwargs"))
        del self._init_args
        del self._init_kwargs
        self.initialized = True

    def newgetattribute(self, name):
        if not object.__getattribute__(self, "initialized"):
            performinit(self)
        return original_getattribute(self, name)

    if hasattr(cls, "__getitem__"):
        original_getitem = cls.__getitem__
        def newgetitem(self, key):
            if not object.__getattribute__(self, "initialized"):
                performinit(self)
            return original_getitem(self, key)
        newgetitem.__doc__ = original_getitem.__doc__
        cls.__getitem__ = newgetitem

    if hasattr(cls, "__len__"):
        original_len = cls.__len__
        def newlen(self):
            if not object.__getattribute__(self, "initialized"):
                performinit(self)
            return original_len(self)
        newlen.__doc__ = original_len.__doc__
        cls.__len__ = newlen

    cls.__init__ = newinit
    cls.__getattribute__ = newgetattribute
    return cls

@lazyload
class FileLoader(dict):
    def __init__(self, filename):
        self.filename = filename
        print "Performing expensive load operation"
        self[32] = "Felix"
        self[33] = "Eeek"

kittens = FileLoader("kitties.csv")
print "kittens is instance of FileLoader: %s" % isinstance(kittens, FileLoader) # Well obviously
print len(kittens) # Wait
print kittens[32] # No wait
print kittens[33] # No wait
print kittens.filename # Still no wait
print kittens.filename

The output:

kittens is instance of FileLoader: True
Performing expensive load operation
2
Felix
Eeek
kitties.csv
kitties.csv

I tried to actually restore the original magic methods after the initialization, but it wasn't working out. It may be necessary to proxy additional magic methods, I didn't investigate every scenario.

Note that kittens.initialized will always return True because it kicks off the initialization if it hasn't already been performed. Obviously it would be possible to add an exemption for this attribute so that it would return False if no other operation had been performed on the object, or the check could be changed to the equivalent of a hasattr call and the initialized attribute could be deleted after the initialization.

score 1 · Answer 4 · answered Dec 29 '10 at 23:47

1

Wouldn't if not self.F lead to another call to __getattr__, putting you into an infinite loop? I think your approach makes sense, but to be on the safe side, I'd make that line into:

if name == "F" and not self.F:

Also, you could make loadfile a method on the class, depending on what you're doing.

answered Dec 29 '10 at 23:47

Thomas K

39,200
7
84
86

2

`__getattr__` is only called if the method is not found (http://docs.python.org/py3k/reference/datamodel.html#object.__getattr__), so I don't think this is a problem. – Anton Geraschenko Dec 29 '10 at 23:59
1

@Anton: Thanks, I didn't know that before. I'll leave this answer so that other people can see that info. – Thomas K Dec 30 '10 at 00:09

score 0 · Answer 5 · answered Dec 30 '10 at 00:36

Here's a hack that makes the "even better" solution work, but I think it's annoying enough that it's probably better to just use the first solution. The idea is to execute the step self = loadfile(self.FILE) by passing the the variable name as an attribute:

class lazyload:
    def __init__(self,FILE,var):
        self.FILE = FILE
        self.var  = var
    def __getattr__(self,name):
        x = loadfile(self.FILE)
        globals()[self.var]=x
        return object.__getattribute__(x, name)

Then you can do

kitties = lazyload("kitties.csv","kitties")
   ^                                 ^
    \                               /
     These two better match exactly

After you call any method on kitties (aside from kitties.FILE or kitties.var), it will become completely indistinguishable from what you'd have gotten with kitties = loadfile("kitties.csv"). In particular, it will no longer be an instance of lazyload and kitties.FILE and kitties.var will no longer exist.

I know you were probably just throwing around ideas, but this kind of thing really should be discouraged. Aside from teh badness that you pointed out yourself, this will only work as long as the object only needs to be referred to by a single global name. Assign it to another name and work on that, and kitties gets reloaded and replaced after every attribute access. Pass it to a function, and every use of the object in the function will cause it to be reloaded. — randomhuman, Feb 02 '13 at 14:28

seriyPS · Answer 6 · 2010-12-30T00:43:42.603

If you need use puppies[32] you need also define __getitem__ method because __getattr__ don't catch that behaviour.

I implement lazy load for my needs, there is non-adapted code:

class lazy_mask(object):
  '''Fake object, which is substituted in
  place of masked object'''

  def __init__(self, master, id):
    self.master=master
    self.id=id
    self._result=None
    self.master.add(self)

  def _res(self):
    '''Run lazy job'''
    if not self._result:
      self._result=self.master.get(self.id)
    return self._result

  def __getattribute__(self, name):
    '''proxy all queries to masked object'''
    name=name.replace('_lazy_mask', '')
    #print 'attr', name
    if name in ['_result', '_res', 'master', 'id']:#don't proxy requests for own properties
      return super(lazy_mask, self).__getattribute__(name)
    else:#but proxy requests for masked object
      return self._res().__getattribute__(name)

  def __getitem__(self, key):
    '''provide object["key"] access. Else can raise 
    TypeError: 'lazy_mask' object is unsubscriptable'''
    return self._res().__getitem__(key)

(master is registry object that load data when i run it's get() method)

This implementation works ok for isinstance() and str() and json.dumps() with it

As far as I can tell, this statement is incorrect: *"If you need use `puppies[32]` you need also define `__getitem__` method because `__getattr__` don't catch that behaviour."* I have actually tested the code in the question and it *does* work. However, it is true that `__getattribute__` doesn't catch `__getitem__`. — Anton Geraschenko, Dec 30 '10 at 00:57

How to lazy load a data structure (python)

6 Answers6

Linked