You can use the pickletools
module to produce a stream of disassembled operations, which would let you collect information about what modules and names the pickled data would need to access. I'd use the pickletools.genops()
function here.
Now, the module is aimed at the core developers working on the pickle library, so documentation on the opcodes this emits is only found in the module source code, and many are tied to specific versions of the protocol, but the GLOBAL
and STACK_GLOBAL
opcodes are the interesting opcodes here. In the case of GLOBAL
, the name loaded is the opcode argument, in the other case, you need to look at the stack. The stack is a little bit more complex than just push and pop operations however, as variable-length items (lists, dicts, etc.) use a marker object to allow the unpickler to detect when such an object has been completed, and there is a memoizing function to avoid having to repeatedly name items in the stream.
The module code details how the stack, memo and various opcodes work, but you generally can ignore most of this if all you need is to know what names are referenced.
So for your stream, and making the assumption that the stream is always well-formed , the following simplification of the dis()
function would let you extract all names referenced by GLOBAL
and STACK_GLOBAL
opcodes:
import pickletools
def get_names(stream):
"""Generates (module, qualname) tuples from a pickle stream"""
stack, markstack, memo = [], [], []
mo = pickletools.markobject
for op, arg, pos in pickletools.genops(stream):
# simulate the pickle stack and marking scheme, insofar
# necessary to allow us to retrieve the names used by STACK_GLOBAL
before, after = op.stack_before, op.stack_after
numtopop = len(before)
if op.name == "GLOBAL":
yield tuple(arg.split(1, None))
elif op.name == "STACK_GLOBAL":
yield (stack[-2], stack[-1])
elif mo in before or (op.name == "POP" and stack and stack[-1] is mo):
markpos = markstack.pop()
while stack[-1] is not mo:
stack.pop()
stack.pop()
try:
numtopop = before.index(mo)
except ValueError:
numtopop = 0
elif op.name in {"PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"}:
if op.name == "MEMOIZE":
memo.append(stack[-1])
else:
memo[arg] = stack[-1]
numtopop, after = 0, [] # memoize and put do not pop the stack
elif op.name in {"GET", "BINGET", "LONG_BINGET"}:
arg = memo[arg]
if numtopop:
del stack[-numtopop:]
if mo in after:
markstack.append(pos)
if len(after) == 1 and op.arg is not None:
stack.append(arg)
else:
stack.extend(after)
And a short demo for your example input:
>>> pickled_bar = pickle.dumps(bar)
>>> for mod, qualname in get_names(pickled_bar):
... print(f"module: {mod}, name: {qualname}")
...
module: __main__, name: bar
or a slightly more involved example with a inspect.Signature()
instance for the same:
>>> import inspect
>>> pickled_sig_set = pickle.dumps({inspect.signature(bar)})
>>> for mod, qualname in get_names(pickled_sig_set):
... print(f"module: {mod}, name: {qualname}")
...
module: inspect, name: Signature
module: inspect, name: _empty
The latter make use of the memoization to re-use the inspect
name for the inspect.Signature.empty
reference, as well as a marker to track where the set elements started.