2

Can someone explain this?

pickle.loads(b'\x80\x03X\x01\x00\x00\x00.q\x00h\x00\x86q\x01.') == pickle.loads(b'\x80\x03X\x01\x00\x00\x00.q\x00X\x01\x00\x00\x00.q\x01\x86q\x02.')
>>>True

pickle.loads(b'\x80\x03X\x01\x00\x00\x00.q\x00h\x00\x86q\x01.')
>>>('.', '.')
pickle.loads(b'\x80\x03X\x01\x00\x00\x00.q\x00X\x01\x00\x00\x00.q\x01\x86q\x02.')
>>>('.', '.')

There seems to be a long and short pickled version of tuples with the same element repeatedly.

Other examples:

pickle.loads(b'\x80\x03X\x01\x00\x00\x00#q\x00X\x01\x00\x00\x00#q\x01\x86q\x02.')
>>>('#', '#')
pickle.loads(b'\x80\x03X\x01\x00\x00\x00#q\x00h\x00\x86q\x01.')
>>>('#', '#')

pickle.loads(b'\x80\x03X\x01\x00\x00\x00$q\x00X\x01\x00\x00\x00$q\x01\x86q\x02.')
>>>('$', '$')
pickle.loads(b'\x80\x03X\x01\x00\x00\x00$q\x00h\x00\x86q\x01.')
>>>('$', '$')

I'm trying to index items by their pickle but I'm not finding the items because their pickles seem to be changing.

I'm using Python 3.3.2 on Ubuntu.

mtanti
  • 794
  • 9
  • 25
  • Observe `pickle.dumps('.', 3) == b'\x80\x03X\x01\x00\x00\x00.q\x00.'`, and the sequence `X\x01\x00\x00\x00.` occurs twice in the longer pickled version of `('.','.')` and only once in the shorter. So although I don't read pickle format, I think what you have there is one pickle file that indicates a tuple containing the same string twice, and another that indicates a tuple containing two strings that happen to be equal. Short string interning in the Python implementation might erase that difference during unpickling, but even if not they look the same when printed and are equal. – Steve Jessop Jan 22 '14 at 00:39
  • 1
    Consider a broadly equivalent question that someone might ask: "I'm indexing objects by their JSON representation, but `[1, 1]` and `[ 1, 1 ]` are different JSON representations of the same object". JSON is quite a simple format, and canonicalizing it is quite simple: make the whitespace regular, order the keys in dictionaries, that's about it. pickle formats are less simple, and for types in general you're going to have some difficulty identifying and intern-ing equal values. You can use a set to store the canonical version of anything hashable, but not everything pickleable is hashable. – Steve Jessop Jan 22 '14 at 00:47
  • @SteveJessop So then how to you store, and search for, arbitrary data in a database? I'm storing the data as pickled and then pickle the search query before searching. – mtanti Jan 22 '14 at 05:33
  • You don't. The definition of "equal" for two values in the database has to match what the database is capable of seeing as equal. So you store and search for structured data in a database. If you can't find a way to structure your data then you have a rather difficult problem, which in some cases cannot be solved other than by moving the logic out of the database. – Steve Jessop Jan 22 '14 at 08:36
  • I just noticed that shelve only allows strings as keys. That explains a lot. Still, I think that it's rather silly to not make pickles canonical, they would have been more useful then. – mtanti Jan 22 '14 at 14:33

1 Answers1

4

Pickles aren't unique; the pickle format is actually a tiny little programming language, and different programs (pickles) can produce the same output (unpickled object). From the docs:

Since the pickle data format is actually a tiny stack-oriented programming language, and some freedom is taken in the encodings of certain objects, it is possible that the two modules [pickle and cPickle] produce different data streams for the same input objects. However it is guaranteed that they will always be able to read each other’s data streams.

There's even a pickletools.optimize function that will take a pickle and output a better pickle. You're going to need to redesign your program.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • 2
    How do you turn objects into canonical indices then? – mtanti Jan 21 '14 at 23:57
  • 1
    @mtanti: Depends on your use case. You might not need to, or you might be able to use `id`, or `repr` might be appropriate, or you could write custom code. – user2357112 Jan 21 '14 at 23:59
  • 1
    @mtanti: If your objects are tuples of two strings… that already makes a pretty good index as-is. It's immutable and hashable, storing and restoring it with any reasonable persistence format (including `pickle`) is guaranteed to give an equal object, etc. – abarnert Jan 22 '14 at 00:02
  • user2357112 @abarnert I'm developing a database that stores arbitrary data which I'm storing as pickles. – mtanti Jan 22 '14 at 00:04
  • @mtanti: What do you mean by "developing a database"? Creating your own database engine meant to directly store Python objects, and expecting to be able to do Python expressions on the database? Or just using a database engine like sqlite or MySQL to store Python objects? – abarnert Jan 22 '14 at 00:48
  • I'm sorry, I meant to say "database application". I'm using sqlite3 and I want to store arbitrary data into it in order to use it as a type of dict with extra features (shelf isn't an alternative no). – mtanti Jan 22 '14 at 05:28
  • Since shelve only allows strings as keys, I am going to only allow strings in my application as well. – mtanti Jan 24 '14 at 06:00