Inspecting a pickle dump for dependencies

Question

Suppose I write the following code:

import pickle
def foo():
    return 'foo'

def bar():
    return 'bar' + foo()

pickle.dump(bar, open('bar.bin', 'wb'))

At this point, I've a binary dump (of course without the dependency of foo from the global scope). Now, if I run the following line

temp = pickle.load(open('bar.bin', 'rb'))

I get the following error, and it makes perfect sense after having read this.

Error: AttributeError: Can't get attribute 'bar' on <module 'main' (built-in)>

This of course is a minimal example but I'm curious if there is a generalized way that I can inspect the dependencies needed to properly unpickle the pickle dump. A trivial solution would be to handle attribute errors (as in the above case), but can I do it programmatically?

At any rate, please do read the _whole [pickle module documentation](https://docs.python.org/3/library/pickle.html). Apart from excellent references on what pickle does and what its limitations are, there is also a *see also* section at the bottom, which tells you about [`pickletools`](https://docs.python.org/3/library/pickletools.html#module-pickletools), which lets you programmatically disassemble the pickle data format. You can then, from the disassembly, learn what modules would need to exist for the pickle to be loadable. — Martijn Pieters, Nov 15 '20 at 22:00

Martijn Pieters · Answer 1 · 2020-11-16T00:00:29.993

You can use the pickletools module to produce a stream of disassembled operations, which would let you collect information about what modules and names the pickled data would need to access. I'd use the pickletools.genops() function here.

Now, the module is aimed at the core developers working on the pickle library, so documentation on the opcodes this emits is only found in the module source code, and many are tied to specific versions of the protocol, but the GLOBAL and STACK_GLOBAL opcodes are the interesting opcodes here. In the case of GLOBAL, the name loaded is the opcode argument, in the other case, you need to look at the stack. The stack is a little bit more complex than just push and pop operations however, as variable-length items (lists, dicts, etc.) use a marker object to allow the unpickler to detect when such an object has been completed, and there is a memoizing function to avoid having to repeatedly name items in the stream.

The module code details how the stack, memo and various opcodes work, but you generally can ignore most of this if all you need is to know what names are referenced.

So for your stream, and making the assumption that the stream is always well-formed , the following simplification of the dis() function would let you extract all names referenced by GLOBAL and STACK_GLOBAL opcodes:

import pickletools

def get_names(stream):
    """Generates (module, qualname) tuples from a pickle stream"""

    stack, markstack, memo = [], [], []
    mo = pickletools.markobject

    for op, arg, pos in pickletools.genops(stream):
        # simulate the pickle stack and marking scheme, insofar
        # necessary to allow us to retrieve the names used by STACK_GLOBAL

        before, after = op.stack_before, op.stack_after
        numtopop = len(before)

        if op.name == "GLOBAL":
            yield tuple(arg.split(1, None))
        elif op.name == "STACK_GLOBAL":
            yield (stack[-2], stack[-1])

        elif mo in before or (op.name == "POP" and stack and stack[-1] is mo):
            markpos = markstack.pop()
            while stack[-1] is not mo:
                stack.pop()
            stack.pop()
            try:
                numtopop = before.index(mo)
            except ValueError:
                numtopop = 0
        elif op.name in {"PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"}:
            if op.name == "MEMOIZE":
                memo.append(stack[-1])
            else:
                memo[arg] = stack[-1]
            numtopop, after = 0, []  # memoize and put do not pop the stack
        elif op.name in {"GET", "BINGET", "LONG_BINGET"}:
            arg = memo[arg]
    
        if numtopop:
            del stack[-numtopop:]
        if mo in after:
            markstack.append(pos)
    
        if len(after) == 1 and op.arg is not None:
            stack.append(arg)
        else:
            stack.extend(after)

And a short demo for your example input:

>>> pickled_bar = pickle.dumps(bar)
>>> for mod, qualname in get_names(pickled_bar):
...     print(f"module: {mod}, name: {qualname}")
...
module: __main__, name: bar

or a slightly more involved example with a inspect.Signature() instance for the same:

>>> import inspect
>>> pickled_sig_set = pickle.dumps({inspect.signature(bar)})
>>> for mod, qualname in get_names(pickled_sig_set):
...     print(f"module: {mod}, name: {qualname}")
...
module: inspect, name: Signature
module: inspect, name: _empty

The latter make use of the memoization to re-use the inspect name for the inspect.Signature.empty reference, as well as a marker to track where the set elements started.

This won't find transitive dependencies, of course - there's no hint in the pickle that whatever `bar` is depends on a `foo` function. No amount of pickle inspection will tell you about `foo`. — user2357112, Nov 15 '20 at 23:19
@user2357112: nope, and there is never a guarantee that what is later on imported when loading a pickle is the same object definition as what was used when pickling. Many a Zope-based project ran upgrades when loading pickles by poking things into `sys.modules` or creative uses of `__setstate__`! — Martijn Pieters, Nov 15 '20 at 23:27

Inspecting a pickle dump for dependencies

1 Answers1

Linked