25

As a follow up to this question: Is there an easy way to pickle a python function (or otherwise serialize its code)?

I would like to see an example of this bullet from the above post:

"If the function references globals (including imported modules, other functions etc) that you need to pick up, you'll need to serialise these too, or recreate them on the remote side. My example just gives it the remote process's global namespace."

I have a simple test going where I am writing a functions byte code to a file using marshal:

def g(self,blah): 
    print blah

def f(self):
    for i in range(1,5):
        print 'some function f'
        g('some string used by g')

data = marshal.dumps(f.func_code)

file = open('/tmp/f2.txt', 'w')
file.write(data)

Then starting a fresh python instance I do:

file = open('/tmp/f2.txt', 'r')
code = marshal.loads(file.read())
func2 = types.FunctionType(code, globals(), "some_func_name");
func2('blah')

This results in a:

NameError: global name 'g' is not defined

This is independent of the different approaches I have made to including g. I have tried basically the same approach to sending g over as f but f can still not see g. How do I get g into the global namespace so that it can be used by f in the receiving process?

Someone also recommended looking at pyro as an example of how to do this. I have already made an attempt at trying to understand the related code in the disco project. I took their dPickle class and tried to recreate their disco/tests/test_pickle.py functionality in a standalone app without success. My experiment had problems doing the function marshaling with the dumps call. Anyway, maybe a pyro exploration is next.

In summary, the basic functionality I am after is being able to send a method over the wire and have all the basic "workspace" methods sent over with it (like g).

Example with changes from answer:

Working function_writer:

import marshal, types

def g(blah): 
    print blah


def f():
    for i in range(1,5):
        print 'some function f'
        g('blah string used by g')


f_data = marshal.dumps(f.func_code)
g_data = marshal.dumps(g.func_code);

f_file = open('/tmp/f.txt', 'w')
f_file.write(f_data)

g_file = open('/tmp/g.txt', 'w')
g_file.write(g_data)

Working function_reader:

import marshal, types

f_file = open('/tmp/f.txt', 'r')
g_file = open('/tmp/g.txt', 'r')

f_code = marshal.loads(f_file.read())
g_code = marshal.loads(g_file.read())

f = types.FunctionType(f_code, globals(), 'f');
g = types.FunctionType(g_code, globals(), 'g');

f()
Community
  • 1
  • 1
Ryan R.
  • 2,478
  • 5
  • 27
  • 48

5 Answers5

36

Updated Sep 2020: See the comment by @ogrisel below. The developers of PiCloud moved to Dropbox shortly after I wrote the original version of this answer in 2013, though a lot of folks are still using the cloudpickle module seven years later. The module made its way to Apache Spark, where it has continued to be maintained and improved. I'm updating the example and background text below accordingly.

Cloudpickle

The cloudpickle package is able to pickle a function, method, class, or even a lambda, as well as any dependencies. To try it out, just pip install cloudpickle and then:

import cloudpickle

def foo(x):
    return x*3

def bar(z):
    return foo(z)+1

x = cloudpickle.dumps(bar)
del foo
del bar

import pickle

f = pickle.loads(x)
print(f(3))  # displays "10"

In other words, just call cloudpickle.dump() or cloudpickle.dumps() the same way you'd use pickle.*, then later use the native pickle.load() or pickle.loads() to thaw.

Background

PiCcloud.com released the cloud python package under the LGPL, and other open-source projects quickly started using it (google for cloudpickle.py to see a few). The folks at picloud.com had an incentive to put the effort into making general-purpose code pickling work -- their whole business was built around it. The idea was that if you had cpu_intensive_function() and wanted to run it on Amazon's EC2 grid, you just replaced:

cpu_intensive_function(some, args) 

with:

cloud.call(cpu_intensive_function, some, args)

The latter used cloudpickle to pickle up any dependent code and data, shipped it to EC2, ran it, and returned the results to you when you called cloud.result().

Picloud billed in millisecond increments, it was cheap as heck, and I used it all the time for Monte Carlo simulations and financial time series analysis, when I needed hundreds of CPU cores for just a few seconds each. Years later, I still can't say enough good things about it and I didn't even work there.

stevegt
  • 1,644
  • 20
  • 26
  • 1
    thank you sir :) I've been struggling with dill for a couple of hours but cloud just works straight forward I do believe that this should be the accepted answer – 55651909-089b-4e04-9408-47c5bf Mar 13 '14 at 08:34
  • 7
    As the original PiCloud client SDK is no longer maintained, a new project was started just to maintain the cloudpickle features: http://github.com/cloudpipe/cloudpickle : `pip install cloudpickle` – ogrisel Apr 30 '15 at 08:33
  • 1
    @stevegt : your example doesn’t seems to save built in function correctly :) – user2284570 Aug 20 '16 at 00:38
  • Cloudpickle worked where dill failed to correcly reimport dependencies, thanks! – H4dr1en Jul 12 '19 at 05:13
  • I used piCloud for a couple of university projects back then, worked like a charm. There still isn't a good alternative that I'm aware of. – Peter Smit Oct 19 '20 at 12:24
6

I have tried basically the same approach to sending g over as f but f can still not see g. How do I get g into the global namespace so that it can be used by f in the receiving process?

Assign it to the global name g. (I see you are assigning f to func2 rather than to f. If you are doing something like that with g, then it is clear why f can't find g. Remember that name resolution happens at runtime -- g isn't looked up until you call f.)

Of course, I'm guessing since you didn't show the code you're using to do this.

It might be best to create a separate dictionary to use for the global namespace for the functions you're unpickling -- a sandbox. That way all their global variables will be separate from the module you're doing this in. So you might do something like this:

sandbox = {}

with open("functions.pickle", "rb") as funcfile:
    while True:
        try:
            code = marshal.load(funcfile)
        except EOFError:
             break
        sandbox[code.co_name] = types.FunctionType(code, sandbox, code.co_name)

In this example I assume that you've put the code objects from all your functions in one file, one after the other, and when reading them in, I get the code object's name and use it as the basis for both the function object's name and the name under which it's stored in the sandbox dictionary.

Inside the unpickled functions, the sandbox dictionary is their globals() and so inside f(), g gets its value from sandbox["g"]. To call f then would be: sandbox["f"]("blah")

kindall
  • 178,883
  • 35
  • 278
  • 309
  • Oh wow, I did not realize the assigned reference made a difference! Thanks! Will post working code. – Ryan R. Apr 06 '12 at 19:36
  • Great, I like the sandbox. Want to explore next auto serializing all of a functions dependencies automatically. Sort of like what the disco modutil.find_modules method does. Appreciate the help. – Ryan R. Apr 06 '12 at 20:01
4

Every module has its own globals, there are no universal globals. We can "implant" restored functions into some module and use this like a normal module.

-- save --

import marshal
def f(x):
    return x + 1
def g(x):
    return f(x) ** 2
funcfile = open("functions.pickle", "wb")
marshal.dump(f.func_code, funcfile)
marshal.dump(g.func_code, funcfile)
funcfile.close()

-- restore --

import marshal
import types
open('sandbox.py', 'w').write('')  # create an empty module 'sandbox'
import sandbox
with open("functions.pickle", "rb") as funcfile:
    while True:
        try:
            code = marshal.load(funcfile)
        except EOFError:
             break
        func = types.FunctionType(code, sandbox.__dict__, code.co_name)
        setattr(sandbox, code.co_name, func)   # or sandbox.f = ... if the name is fixed
assert sandbox.g(3) == 16   # f(3) ** 2
# it is possible import them from other modules
from sandbox import g

Edited:
You can do also import some module .e.g. "sys" to "sandbox" namespace from outside:

sandbox.sys = __import__('sys')

or the same:

exec 'import sys' in sandbox.__dict__
assert 'sys' in sandbox, 'Verify imported into sandbox'

Your original code would work if you do it not in ipython interactive but in a python program or normal python interactive!!!

Ipython uses some strange namespace that is not a dict of any module from sys.modules. Normal python or any main program use sys.modules['__main__'].__dict__ as globals(). Any module uses that_module.__dict__ which is also OK, only ipython interactive is a problem.

hynekcer
  • 14,942
  • 6
  • 61
  • 99
  • Thanks! +1 Was curious about that too. – Ryan R. Apr 07 '12 at 00:42
  • 1
    @RyanR. Your original code wold work if normal python is used not ipython. – hynekcer Apr 07 '12 at 09:31
  • Isn't 'import x ; x.method()' type usecases a problem in the remote scripts? As in:http://stackoverflow.com/questions/10099326/how-to-do-an-embedded-python-module-for-remote-sandbox-execution – Ryan R. Apr 11 '12 at 03:42
3

You can get a better handle on global objects by importing __main__, and using the methods available in that module. This is what dill does in order to serialize almost anything in python. Basically, when dill serializes an interactively defined function, it uses some name mangling on __main__ on both the serialization and deserialization side that makes __main__ a valid module.

>>> import dill
>>> 
>>> def bar(x):
...   return foo(x) + x
... 
>>> def foo(x):
...   return x**2
... 
>>> bar(3)
12
>>> 
>>> _bar = dill.loads(dill.dumps(bar))
>>> _bar(3)
12

Actually, dill registers it's types into the pickle registry, so if you have some black box code that uses pickle and you can't really edit it, then just importing dill can magically make it work without monkeypatching the 3rd party code.

Or, if you want the whole interpreter session sent over as an "python image", dill can do that too.

>>> # continuing from above
>>> dill.dump_session('foobar.pkl')
>>>
>>> ^D
dude@sakurai>$ python
Python 2.7.5 (default, Sep 30 2013, 20:15:49) 
[GCC 4.2.1 (Apple Inc. build 5566)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> dill.load_session('foobar.pkl')
>>> _bar(3)
12

You can easily send the image across ssh to another computer, and start where you left off there as long as there's version compatibility of pickle and the usual caveats about python changing and things being installed.

iMom0
  • 12,493
  • 3
  • 49
  • 61
Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • 2
    but then if a Python program defines foo and bar and pickles bar into a file (using dill), and another Python program loads the pickled file into _bar and calls _bar(3), it errors out with foo being undefined. Why doesn't it work in that case? – David Brochart Jul 13 '16 at 14:03
  • I'm not sure I see what exactly you are asking, can you maybe provide more detail (either in a question of it's own, or on the github issues page for `dill`)? – Mike McKerns Jul 13 '16 at 21:40
  • I opened a new issue here: https://github.com/uqfoundation/dill/issues/176 – David Brochart Jul 13 '16 at 22:01
3

Dill (along with other pickle variants, cloudpickle, etc.) seem to work when the function(s) being pickled are in the main module along with the pickling. If you are pickling a function from another module, that module name has to be present when the unpickling happens. I cannot seem to find a way around this limitation.

Prasanna
  • 381
  • 2
  • 9