4

Background

When experimenting with machine learning, I often reuse models trained previously, by means of pickling/unpickling. However, when working on the feature-extraction part, it's a challenge not to confuse different models. Therefore, I want to add a check that ensures that the model was trained using exactly the same feature-extraction procedure as the test data.

Problem

My idea was the following: Along with the model, I'd include in the pickle dump a hash value which fingerprints the feature-extraction procedure.

When training a model or using it for prediction/testing, the model wrapper is given a feature-extraction class that conforms to certain protocol. Using hash() on that class won't work, of course, as it isn't persistent across calls. So I thought I could maybe find the source file where the class is defined, and get a hash value from that file.

However, there might be a way to get a stable hash value from the class’s in-memory contents directly. This would have two advantages: It would also work if no source file can be found. And it would probably ignore irrelevant changes to the source file (eg. fixing a typo in the module docstring). Do classes have a code object that could be used here?

Davis Herring
  • 36,443
  • 4
  • 48
  • 76
lenz
  • 5,658
  • 5
  • 24
  • 44
  • It’s impossible to prove that none of the functions that your methods *call* haven’t changed. But it’s pretty straightforward to hash the method definitions themselves (at least within one version of Python). Is that enough? – Davis Herring Oct 06 '18 at 01:09
  • Yes, I think that would be pretty good in most cases. I have learned in the meantime that classes don't have a code object, only function objects do. So for the class hash one would need to iterate over the method hashes. – lenz Oct 06 '18 at 20:22
  • They have a code object, but it’s executed and discarded when the class is created (much like a module’s). – Davis Herring Oct 06 '18 at 21:56

1 Answers1

3

All you’re looking for is a hash procedure that includes all the salient details of the class’s definition. (Base classes can be included by including their definitions recursively.) To minimize false matches, the basic idea is to apply a wide (cryptographic) hash to a serialization of your class. So start with pickle: it supports more types than hash and, when it uses identity, it uses a reproducible identity based on name. This makes it a good candidate for the base case of a recursive strategy: deal with the functions and classes whose contents are important and let it handle any ancillary objects referenced.

So define a serialization by cases. Call an object special if it falls under any case below but the last.

  • For a tuple deemed to contain special objects:
    1. The character t
    2. The serialization of its len
    3. The serialization of each element, in order
  • For a dict deemed to contain special objects:
    1. The character d
    2. The serialization of its len
    3. The serialization of each name and value, in sorted order
  • For a class whose definition is salient:
    1. The character C
    2. The serialization of its __bases__
    3. The serialization of its vars
  • For a function whose definition is salient:
    1. The character f
    2. The serialization of its __defaults__
    3. The serialization of its __kwdefaults__ (in Python 3)
    4. The serialization of its __closure__ (but with cell values instead of the cells themselves)
    5. The serialization of its vars
    6. The serialization of its __code__
  • For a code object (since pickle doesn’t support them at all):
    1. The character c
    2. The serializations of its co_argcount, co_nlocals, co_flags, co_code, co_consts, co_names, co_freevars, and co_cellvars, in that order; none of these are ever special
  • For a static or class method object:
    1. The character s or m
    2. The serialization of its __func__
  • For a property:
    1. The character p
    2. The serializations of its fget, fset, and fdel, in that order
  • For any other object: pickle.dumps(x,-1)

(You never actually store all this: just create a hashlib object of your choice in the top-level function, and in the recursive part update it with each piece of the serialization in turn.)

The type tags are to avoid collisions and in particular to be prefix-free. Binary pickles are already prefix-free. You can base the decision about a container on a deterministic analysis of its contents (even if heuristic) or on context, so long as you’re consistent.

As always, there is something of an art to balancing false positives against false negatives: for a function, you could include __globals__ (with pruning of objects already serialized to avoid large if not infinite serializations) or just any __name__ found therein. Omitting co_varnames ignores renaming local variables, which is good unless introspection is important; similarly for co_filename and co_name.

You may need to support more types: look for static attributes and default arguments that don’t pickle correctly (because they contain references to special types) or at all. Note of course that some types (like file objects) are unpicklable because it’s difficult or impossible to serialize them (although unlike pickle you can handle lambdas just like any other function once you’ve done code objects). At some risk of false matches, you can choose to serialize just the type of such objects (as always, prefixed with a character ? to distinguish from actually having the type in that position).

Davis Herring
  • 36,443
  • 4
  • 48
  • 76
  • This can catch changes in a function body, but it won't catch changes in any code outside the function that the function relies on. If it tries to serialize `def f(thing): return other_function(thing) - 1` and `other_function` gets completely overhauled, this serialization scheme won't notice. – user2357112 Oct 07 '18 at 04:52
  • @user2357112: Of course, and it won’t catch that the function’s behavior depends on the system clock either. I already asked about the general impossibility in a comment on the question, and you *can* branch out into `__globals__` to at least catch helpers in the same module if it’s a concern. – Davis Herring Oct 07 '18 at 05:01
  • Thanks for the well-structured and -explained instructions. As you say, the hardest part is figuring out what to include, in the trade-off of avoiding false matches vs. rendering the complete process useless by always generating a different hash value. – lenz Oct 07 '18 at 09:46