0

I have the following project structure:

   Package1
   |--__init__.py
   |--__main__.py
   |--Module1.py
   |--Module2.py

where Module1.py contains something like:

import dill as pickle
import Package1.Module2

# from https://stackoverflow.com/questions/52402783/pickle-class-definition-in-module-with-dill
def mainify(obj):
    import __main__
    import inspect
    import ast

    s = inspect.getsource(obj)
    m = ast.parse(s)
    co = compile(m, "<string>", "exec")
    exec(co, __main__.__dict__)

def Module1():
    """I hope the details of this class are not necessary for this example. I can add detail if necessary
    """

obj_to_pickle = Module1()

def write_session():
    mainify(Module1)
    mainify(Module2)
    with FileHandler.open_file(...) as f:
        pickle.dump(obj_to_pickle, f)

I run the code as a module via python -m Package1 ..., thus __main__.py is the entry point to package execution, though I hope these details aren't relevant (I can improve my example if necessary).

Now, when I try to load the pickled object, I get ModuleNotFoundError: No module named Package1.

How can tell dill in this situation to understand that Package1 is the package? The mainify function seems to be getting the modules' source code into the pickle, but I believe the import statement in Module1.py that is import Package1.Module2.py is causing the ImportError. How can I tell dill to understand the reference to Package1?

NOTE: this reference can be fixed by adding the directory that Package1 is in via sys.path.append. But the whole point of pickling the package source alongside the instance is to make pickled instance unpicklable without needed to do this.

Relevant posts:

Pickle class definition in module with dill

Why dill dumps external classes by reference, no matter what?

courtyardz
  • 73
  • 1
  • 6

1 Answers1

0

@courtyardz. I'm a contributor of dill and your question is similar to others that have been asked in the past.

First, let me explain that generally dill assumes that all the modules necessary to deserialize an object are importable in the "unpickling" environment. Therefore modules are almost always saved by reference, with the current exception of modules that are not properly installed, like local modules (e.g. located in the working directory) or modules at non-canonical paths added to sys.path. There's also a function that's able to save the complete state of a module, which can be restored afterwards, but not the module itself.

That said, what exactly do you need? It's to serialize an object alongside its class (including any objects in the module's namespace that it refers to), or it's really the whole module?

If you need to transfer the complete module to an interpreter session where it's not available, like in a different machine, this problem is under active discussion here: https://github.com/uqfoundation/dill/issues/123. There's no complete solution for this currently, but one possibility is to ship the module as a ZIP archive, and load it using the zipimport module (indirectly, by saving the zip file to disk, maybe in a temporary location, and adding its path to sys.path as described in Python's documentation).

If you just need to serialize an object with its class, note that doing such has the limitation that objects of that class pickled by separate calls to dill.dump() or dill.dumps() will end up having different (although identical) classes when unpickled. This may or may not be a problem. There's also an open discussion about forcing the serialization of a class by value: https://github.com/uqfoundation/dill/issues/424.

The workaround you are trying to use should work because dill pickles classes defined in the __main__ module by value, as well as "orphaned" classes, i.e. classes that can't be found in the module where they were defined. However, for this to work the object must be created by the __main__.Module1 class (I suppose this is a class, even though you used def instead of class in your code example), not the Package1.Module1.Module1 class. If the class references global objects in Module1 in its methods, you may need to use the option recurse=True with dill.dump(s).

A simpler workaround, that may not work for your specific case as it involves multiple modules, is to temporarily change the __module__ attribute of the class. For example, at a module's body:

import dill

class X:
    pass

obj = X()

X.__module__ = None  # temporarily orphan the class
with open('/path/to/file.pkl', 'wb') as file:
    dill.dump(obj)  # X will be pickled by value because __module__ is None
X.__module__ = __name__  # de-orphan the class

Going back to your example, if you can't create the object with the "mainified" class, you may change the object's class temporarily too:

obj_to_pickle = Module1()

def write_session():
    mainify(Module1)
    mainify(Module2)
    obj_to_pickle.__class__ = __main__.Module1
    with FileHandler.open_file(...) as f:
        pickle.dump(obj_to_pickle, f)
    obj_to_pickle.__class__ = Module1

If the object has instance attributes of types defined in Package1, it won't work however.

leogama
  • 898
  • 9
  • 13