how to do an embedded python module for remote sandbox execution?

Question

I am trying to dynamically add python code to a sandbox module for executing on a remote machine. I am experiencing an issue with how to deal with imported methods. For example, it is common to see scripts written such as:

 from test_module import g
 import other_module

 def f():
     g()
     other_module.z()

I know I can pickle f with g and potentially z but how do I preserve the "other_module" scope for z? If I put both f and g in the sandbox then z is not going to be resolved properly when f is called. Is it possible to use some type of embedded module to get z resolved correctly, i.e. sandbox.other_module?

My purpose for loading remote code into a sandbox is to not pollute the global namespace. For instance, if another remote method is invoked with it's own dependency graph then it should not interfere with another set of remote code. Is it realistic to expect python to be stable with sandbox modules coming in and out of usage? I say this because of this post: How do I unload (reload) a Python module? which makes me feel it can be problematic removing modules such as different sandboxes in this case.

It follows the question [How to pickle a python function with it's dependencies?](http://stackoverflow.com/q/10048061/448474) — hynekcer, Apr 11 '12 at 08:55

score 1 · Accepted Answer · answered Apr 11 '12 at 08:51

Other modules can be imported to sandbox (you mean modules that are created dynamically at runtime) by

    sandbox.other_module = __import__('other_module')

or:

    exec 'import other_module' in sandbox.__dict__

If you call "sandbox" modules from other modules or other sandbox modules and you want to reload some new code later, it is easier to import only a module, not names from it like "from sandbox import f", and call "sandbox.f" not "f". Then is reloading easy. (but naturarely reload command is not useful for it)

Classes

>>> class A(object): pass
... 
>>> a = A()
>>> A.f = lambda self, x: 2 * x  # or a pickled function
>>> a.f(1)
2
>>> A.f = lambda self, x: 3 * x
>>> a.f(1)
3

It seems that reloading methods can be easy. I remember that reloading classes defined in a modified source code can be complicated because the old class code can be held by some instance. The instance's code can/need be updated individually in the worst case:

    some_instance.__class__ = sandbox.SomeClass  # that means the same reloaded class

I used the latter with a python service accessed via win32com automation and reloading of classes code was succesful without loss instances data

Ryan R. · Answer 2 · 2012-04-12T19:52:34.337

Current approach I have going for enabling both 'import x' and 'from x import y' dependency bundling. One draw back, to this current implementation, is it creates copies of the methods in each module that is used, in contrast to the code origin where each usage is just a reference to the same method in memory (though I have conflicting results here - see section after code).

/// analysis_script.py /// (dependencies excluded for brevity)

import test_module
from third_level_module import z

def f():
    for i in range(1,5):
        test_module.g('blah string used by g')
        z()

/// driver.py ///

import modutil
import analysis_script

modutil.serialize_module_with_dependencies(analysis_script)

/// modutil.py ///

import sys
import modulefinder
import os
import inspect
import marshal

def dump_module(funcfile, name, module):
    functions_list = [o for o in inspect.getmembers(module) if inspect.isfunction(o[1])]
    print 'module name:' + name
    marshal.dump(name, funcfile)
    for func in functions_list:
       print func
       marshal.dump(func[1].func_code, funcfile)

def serialize_module_with_dependencies(module):

    python_path = os.environ['PYTHONPATH'].split(os.pathsep)
    module_path = os.path.dirname(module.__file__)

    #planning to search for modules only on this python path and under the current scripts working directory
    #standard libraries should be expected to be installed on the target platform
    search_dir = [python_path, module_path]

    mf = modulefinder.ModuleFinder(search_dir)

    #__file__ returns the pyc after first run
    #in this case we use replace to get the py file since we need that for our call to       mf.run_script
    src_file = module.__file__
    if '.pyc' in src_file:
        src_file = src_file.replace('.pyc', '.py')

    mf.run_script(src_file)

    funcfile = open("functions.pickle", "wb")

    dump_module(funcfile, 'sandbox', module)

    for name, mod in mf.modules.iteritems():
        #the sys module is included by default but has no file and we don't want it anyway, i.e. should
        #be on the remote systems path. __main__ we also don't want since it should be virtual empty and
        #just used to invoke this function.
        if not name == 'sys' and not name == '__main__':
            dump_module(funcfile, name, sys.modules[name])

    funcfile.close()

/// sandbox_reader.py ///

import marshal
import types
import imp

sandbox_module = imp.new_module('sandbox')

dynamic_modules = {}
current_module = ''
with open("functions.pickle", "rb") as funcfile:
    while True:
        try:
            code = marshal.load(funcfile)
        except EOFError:
             break

        if isinstance(code,types.StringType):
            print "module name:" + code
            if code == 'sandbox':
                current_module = "sandbox"
            else:
                current_module = imp.new_module(code)
                dynamic_modules[code] = current_module
                exec 'import '+code in sandbox_module.__dict__
        elif isinstance(code,types.CodeType):
            print "func"
            if current_module == "sandbox":
                func = types.FunctionType(code, sandbox_module.__dict__, code.co_name)
                setattr(sandbox_module, code.co_name, func)
            else:
                func = types.FunctionType(code, current_module.__dict__, code.co_name)
                setattr(current_module, code.co_name, func)
        else:
            raise Exception( "unknown type received")

#yaa! actually invoke the method
sandbox_module.f()
del sandbox_module

For instance the function graph looks like this before serialization:

 module name:sandbox
 ('f', <function f at 0x15e07d0>)
 ('z', <function z at 0x7f47d719ade8>)
 module name:test_module
 ('g', <function g at 0x15e0758>)
 ('z', <function z at 0x7f47d719ade8>)
 module name:third_level_module
 ('z', <function z at 0x7f47d719ade8>)

Specifically, looking at the function z we can see that all the references point to the same address, i.e. 0x7f47d719ade8.

On the remote process after sandbox reconstruction we have:

 print sandbox_module.z 
 <function z at 0x1a071b8>
 print sandbox_module.third_level_module.z 
 <function z at 0x1a072a8>
 print sandbox_module.test_module.z 
 <function z at 0x1a072a8>

This blows my mind! I would have thought all addresses here would be unique after reconstruction but for some reason sandbox_module.test_module.z and sandbox_module.third_level_module.z have the same address?

I think that it does not create more copies but more references to that function. However the original functions is not removed from the memory until all references are removed or replaced. This is why I preferred call "some_mod.f" instead of call "f". — hynekcer, Apr 12 '12 at 14:08
If I import names from a normal module, I must also update references of names to the reloaded object individually. What difference did you found between normal and "sandbox" in memory usage? Btw. Do you think, that my original answer to pickling was useful for you or are you not sure? (button) — hynekcer, Apr 12 '12 at 14:25
Both your answers were very useful! Gave +1 on both and checkmark for this thread. — Ryan R., Apr 12 '12 at 19:02
Added detail on confusion over reconstructed references to functions. — Ryan R., Apr 12 '12 at 20:34
This approach posted here dumps all of the imports into the sandbox module. This is not good. The proper way to do this is to reconstruct the imports per module as how they exist in the sending process. — Ryan R., Apr 17 '12 at 19:19

score 1 · Answer 3 · edited Mar 01 '14 at 17:24

You probably do not want to serialize imported functions from Python library e.g. mathematical functions or big packages that are mixed Python + C, but your code serializes it. It can cause unnecessary problems that they have no func_code attribute etc.
You need not to serialize repeatedly functions that has been serialized previously. You can send full name and import them according this. This is the reason why you had it multiple times in the memory.
The original format <module_name> <serialized_func1> <serialized_func2>... is not general enough. A function can be on the local machine imported under different names by "... import ... as ..." clause. You can serialize list of tuples of mixed strings and code objects.

Hint

def some_filter(module_name):
    mod_path = sys.modules[module_name].__file__
    # or if getattr(sys.modules[module_name], 'some_my_attr', None)
    return not mod_path.startswith('/usr/lib/python2.7/')

dumped_funcs = {}

def dump_module(...
    ...
    data = []
    for func_name, func_obj in functions_list:
        if some_filter(func_obj.__module__) and not func_obj in dumped_funcs and \
                    hasattr(func_obj, 'func_code'):
            data.append((func_name, func_obj.func_code))
            dumped_funcs[func_obj] = True  # maybe later will be saved package.mod.fname
        else:
            data.append((func_name, '%s.%s' % (func_obj.__module__, \
                                               func_obj.func_code.co_name)))
    marshal.dump(data, funcfile)

Now is no important difference between sandbox and other serialized module. Conditions "if sandbox" can be soon removed.

This worked well. It was exactly what I needed to do! Thanks. — Ryan R., Apr 17 '12 at 19:16

how to do an embedded python module for remote sandbox execution?

3 Answers3

Linked