0

Given a custom, new-style python class instance, what is a good way to hash it and get a unique ID-like value from it to use for various purposes? Think md5sum or sha1sum of a given class instance.

The approach I am currently using pickles the class and runs that through hexdigest, storing the resultant hash string into a class property (this property is never part of the pickle/unpickle procedures, fyi). Except now I've run into a case where a third-party module uses nested classes, and there is no really good way to pickle those without some hacks. I figure that I am missing out on some clever little Python trick somewhere to accomplish this.

Edit:

Example code because it seems to be a requirement around here to get any traction on a question. The below class can be initialized and the self._uniq_id property can be properly setup.

#!/usr/bin/env python

import hashlib

# cPickle or pickle.
try:
   import cPickle as pickle
except:
   import pickle
# END try

# Single class, pickles fine.
class FooBar(object):
    __slots__ = ("_foo", "_bar", "_uniq_id")

    def __init__(self, eth=None, ts=None, pkt=None):
        self._foo = "bar"
        self._bar = "bar"
        self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]

    def __getstate__(self):
        return {'foo':self._foo, 'bar':self._bar}

    def __setstate__(self, state):
        self._foo = state['foo']
        self._bar = state['bar']
        self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]

    def _get_foo(self): return self._foo
    def _get_bar(self): return self._bar
    def _get_uniq_id(self): return self._uniq_id

    foo = property(_get_foo)
    bar = property(_get_bar)
    uniq_id = property(_get_uniq_id)
# End




This next class, however, cannot be initialized because of Bar being nested in Foo:

#!/usr/bin/env python

import hashlib

# cPickle or pickle.
try:
   import cPickle as pickle
except:
   import pickle
# END try

# Nested class, can't pickle for hexdigest.
class Foo(object):
    __slots__ = ("_foo", "_bar", "_uniq_id")

    class Bar(object):
        pass

    def __init__(self, eth=None, ts=None, pkt=None):
        self._foo = "bar"
        self._bar = self.Bar()
        self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]

    def __getstate__(self):
        return {'foo':self._foo, 'bar':self._bar}

    def __setstate__(self, state):
        self._foo = state['foo']
        self._bar = state['bar']
        self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]

    def _get_foo(self): return self._foo
    def _get_bar(self): return self._bar
    def _get_uniq_id(self): return self._uniq_id

    foo = property(_get_foo)
    bar = property(_get_bar)
    uniq_id = property(_get_uniq_id)
# End


The error I receive is:

Traceback (most recent call last):
  File "./nest_test.py", line 70, in <module>
    foobar2 = Foo()
  File "./nest_test.py", line 49, in __init__
    self._uniq_id = hashlib.sha1(pickle.dumps(self, -1)).hexdigest()[0:16]
cPickle.PicklingError: Can't pickle <class '__main__.Bar'>: attribute lookup __main__.Bar failed


(nest_test.py) has both classes in it, hence the line number offset).


Pickling requires the __getstate__() method I found out, so I also implemented __setstate__() for completeness as well. But given the already existing warnings about security and pickle, there's got to be a better way to do this.


Based on what I have read so far, the error stems from Python not being able to resolve the nested classes. It tries to look up the attribute __main__.Bar, which doesn't exist. It really needs to be able to find __main__.Foo.Bar instead, but there is no really good way to do this. I bumped into another SO answer here that provides a "hack" to trick Python, but it came with a stern warning that such an approach is not advisable, and to either use something other than pickling or to move the nested class definition to the outside versus the inside.

However, the original question of that SO answer, I believe, was for pickling and unpickling to a file. I only need to pickle in order to use the requisite hashlib functions, which seem to operate on a bytearray (much like I am used to in .NET), and pickling (Especially cPickle) is fast and optimized versus writing my own bytearray routine.

Kumba
  • 2,390
  • 3
  • 33
  • 60

2 Answers2

2

That depends entirely on what properties the ID should have.

For instance, you can use id(foo) to get an ID which is guaranteed to be unique as long as foo is active in memory, or you could use repr(instance.__dict__) if all of the fields have sensible repr values.

What specifically do you need it for?

David Wolever
  • 148,955
  • 89
  • 346
  • 502
  • Still learning Python, so I am not familiar with all the bells and whistles of a class. What, exactly, is `id()`? – Kumba Feb 18 '12 at 08:03
  • @Kumba In CPython id() gets you the memory-adress of the object. – Juri Robl Feb 18 '12 at 08:59
  • @Kumba: Does the repr suggestion not meet your needs? It seems like the obvious way to do it. – Marcin Feb 18 '12 at 11:10
  • @Marcin: Don't think so, because I am using `__slots__` to cut down on memory usage(so, no `instance.__dict__` is available). I also haven't defined a `__repr__()` function for the classes in question (haven't had a need yet, really). – Kumba Feb 18 '12 at 14:17
  • @Kumba: Why don't you define an appropriate repr, then hash that? – Marcin Feb 18 '12 at 14:23
  • @Marcin: I added an example. I thought I gave enough in the description to get the idea across, but someone probably downvoted me over it (why do people around here want examples for the most trivial of things??). – Kumba Feb 18 '12 at 22:24
  • 1
    @Kumba people want examples because it's very for words and descriptions to be misinterpreted, but it's much harder to misinterpret a code example. – David Wolever Feb 18 '12 at 22:46
  • @Kumba I don't see anything that prevents you from definingan appropriate repr – Marcin Feb 18 '12 at 22:59
  • @David: Agreed, in part. I figured, though, that by stating the inability to pickle a nested class would have sufficed because it seemed to, from my Google searches, be a well-documented issue within the Python community. I haven't run across an SO answer yet that provides a good solution, not a hack, to pickling nested classes. Usually, this mean no such answer exists, but I was holding out hope. – Kumba Feb 19 '12 at 01:10
0

While you're using hexdigests of pickles at the moment, you make it sound like the id doesn't actually need to be related to the object, it just needs to be unique. Why not simply use the uuid module, specifically uuid.uuid4 to generate unique IDs and assign them to a uuid field in the object...

Endophage
  • 21,038
  • 13
  • 59
  • 90
  • What I basically want the equivalent of an md5sum or sha1sum of the given object. If the object changes, I would recalc the hash to match. Although, currently, mt classes are used to parse data from a file and reformat it into another form, then write it back out. Creating a hash of each object and the snippet of data that it holds seems sensible for future development if I decide to expand the capabilities any. – Kumba Feb 18 '12 at 08:05
  • To add, I know that in .NET, this is doable by converting the object into a byte array, then feeding it to one of .NET's many hashing functions. I guess pickling in Python is the same way. It's just the matter of the nested classes in the 3rd-party module that effectively renders the pickle approach unusable. – Kumba Feb 18 '12 at 08:08
  • Gotcha. It sounds like it would make sense to actually create a hash from the properties of the object, so the value of the data itself and any meta data. A hash of the whole object (like the pickle) is going to be tied to python. A hash of just the data and meta data, in a specific order, would be language agnostic. That would allow you or some other developer that had to interface with your code in another language to reproduce and confirm your data signature. – Endophage Feb 18 '12 at 08:11
  • A good idea, but this code is going to specifically be in Python. It's mostly just some data formatting code, so I am open to a "Pythonic" way of tackling this, as long as it gets around the nested class problem so I don't have to go badger the author of the 3rd-party module (which is open-source). – Kumba Feb 18 '12 at 08:30
  • Took a look at this, but uuid wasn't introduced until Python 2.5. I need to support 2.4 to 2.7+, so uuid, which looks near perfect, isn't available on a few machines these scripts will run on. Thanks, though! – Kumba Feb 19 '12 at 08:19
  • Ah fair enough. It's always tougher when you have to support legacy code. – Endophage Feb 20 '12 at 04:00