1

I know this has been covered by a number of other questions (Unable to load files using pickle and multipile modules) but I can't see how their solutions apply to my situation.

This is my project structure (as minimal as possible):

classify-updater/
├── main.py
└── updater
    ├── __init__.py
    └── updater.py
classify
└── main.py

In classify-updater/main.py:

import sys
from sklearn.feature_extraction.text import CountVectorizer
from updater.updater import Updater

def main(argv):
    vectorizer = CountVectorizer(stop_words='english')
    updater = Updater(vectorizer)
    updater.update()

if __name__ == "__main__":
    main(sys.argv)

In classify-updater/updater/updater.py:

import dill

class Updater:

    def __init__(vectorizer):
        vectorizer.preprocessor = lambda doc: doc.text.encode('ascii', 'ignore')
        self.vectorizer = vectorizer

    def update(self):
        pickled_vectorizer = dill.dumps(self.vectorizer)
        # Save to Google Cloud Storage

In classify/main.py

import dill
import sys

def main(argv):
    # Load from Google Cloud Storage
    vectorizer = dill.loads(vectorizer_blob)

if __name__ == "__main__":
    main(sys.argv)

This results in an ImportError.

Traceback (most recent call last):
  File "classify.py", line 102, in <module>
    app.main(sys.argv)
  File "classify.py", line 50, in main
    vectorizer = self.fetch_vectorizer()
  File "classify.py", line 86, in fetch_vectorizer
    vectorizer = dill.loads(vectorizer_blob.download_as_string())
  File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 299, in loads
    return load(file)
  File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 288, in load
    obj = pik.load()
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1096, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 445, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/local/Cellar/python/2.7.13_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1130, in find_class
    __import__(module)
ImportError: No module named updater.updater

It has been explained elsewhere that pickle needs the class definition to load the object, but I can't see where the reference to the updater module comes from as I'm only pickling an instance of the Vectorizer.

I've simplified this example heavily. The two packages sit quite far apart in terms of our codebase. Importing one module into the other might not be feasible. Is there any way to work around this?

Josh
  • 3,445
  • 5
  • 37
  • 55
  • As a workaround, you might append updater's path to the `PYTHONPATH` and then import it. No workaround I'm afraid (to my knowledge), you'll need to import updater. – cs95 Jul 28 '17 at 11:24
  • @cᴏʟᴅsᴘᴇᴇᴅ when I say work around, I mean something like a shared class that just does pickling and unpickling. What exactly gets saved with the pickle? Is it the immediate parent that triggered the pickle or something else? – Josh Jul 28 '17 at 11:26
  • Something like a snapshot of the instance - its data and attributes is pickled. When loading, you'd still need a container to affix the snapshot to. – cs95 Jul 28 '17 at 11:30
  • @cᴏʟᴅsᴘᴇᴇᴅ can you expand on "you'd still need a container to affix the snapshot to"? I don't really understand what that means in this context. – Josh Jul 28 '17 at 11:47
  • It's a little hard for me to explain... I'm not good with the technical terms. But basically, you need to source file available as byte code. Just the object is not enough. – cs95 Jul 28 '17 at 11:48
  • @cᴏʟᴅsᴘᴇᴇᴅ I must have completely misunderstood the purpose of pickle then because it makes no sense that you need to have a reference to an arbitrary object which has a reference to the thing you're pickling. In any other language (my background isn't Python so forgive me), if you want to serialise an object you just need to know how to unserialise it. It's crazy that the sender and receiver of some serialised data have to have the same module. You're forced to leak functionality across boundaries when all I want to do is pass an object between two entirely unrelated objects. – Josh Jul 28 '17 at 12:00

1 Answers1

2

The issue here is the lambda (anonymous function).

It is completely possible to pickle a self-contained object like the Vectorizer. However, the preprocessing function used in the example is scoped to the Updater class so the Updater class is required to unpickle.

Rather than having a preprocessor function, preprocess the data yourself and pass that in to fit the vectorizer. That will remove the need for the Updater class when unpickling.

Josh
  • 3,445
  • 5
  • 37
  • 55