9

Is anyone able to explain the comment under testLookups() in this code snippet?

I've run the code and indeed what the comment sais is true. However I'd like to understand why it's true, i.e. why is cPickle outputting different values for the same object depending on how it is referenced.

Does it have anything to do with reference count? If so, isn't that some kind of a bug - i.e. the pickled and deserialized object would have an abnormally high reference count and in effect would never get garbage collected?

AakashM
  • 62,551
  • 17
  • 151
  • 186
julx
  • 8,694
  • 6
  • 47
  • 86

2 Answers2

9

There is no guarantee that seemingly identical objects will produce identical pickle strings.

The pickle protocol is a virtual machine, and a pickle string is a program for that virtual machine. For a given object there exist multiple pickle strings (=programs) that will reconstruct that object exactly.

To take one of your examples:

>>> from cPickle import dumps
>>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
>>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]))
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
>>> dumps(t)
"((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2\nI1\naI2\naI3\naI4\naI5\natp3\n."

The two pickle strings differ in their use of the p opcode. The opcode takes one integer argument and its function is as follows:

  name='PUT'    code='p'   arg=decimalnl_short

  Store the stack top into the memo.  The stack is not popped.

  The index of the memo location to write into is given by the newline-
  terminated decimal string following.  BINPUT and LONG_BINPUT are
  space-optimized versions.

To cut a long story short, the two pickle strings are basically equivalent.

I haven't tried to nail down the exact cause of the differences in generated opcodes. This could well have to do with reference counts of the objects being serialized. What is clear, however, that discrepancies like this will have no effect on the reconstructed object.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • Yea, you're right - I've just read the same thing in the docs. BTW, what an awesome idea to have a stack VM for serialization. – julx Sep 21 '11 at 15:03
6

It is looking at the reference counts, from the cPickle source:

if (Py_REFCNT(args) > 1) {
    if (!( py_ob_id = PyLong_FromVoidPtr(args)))
        goto finally;

    if (PyDict_GetItem(self->memo, py_ob_id)) {
        if (get(self, py_ob_id) < 0)
            goto finally;

        res = 0;
        goto finally;
    }
}

The pickle protocol has to deal with pickling multiple references to the same object. In order to prevent duplicating the object when depickled it uses a memo. The memo basically maps indexes to the various objects. The PUT (p) opcode in the pickle stores the current object in this memo dictionary.

However, if there is only a single reference to an object, there is no reason to store it it the memo because it is impossible to need to reference it again because it only has one reference. Thus the cPickle code checks the reference count for a little optimization at this point.

So yes, its the reference counts. But not that's not a problem. The objects unpickled will have the correct reference counts, it just produces a slightly shorter pickle when the reference counts are at 1.

Now, I don't know what you are you doing that you care about this. But you really shouldn't assume that pickling the same object will always give you the same result. If nothing else, I'd expect dictionaries to give you problems because the order of the keys is undefined. Unless you have python documentation that guarantees the pickle is the same each time I highly recommend you don't depend on it.

Winston Ewert
  • 44,070
  • 10
  • 68
  • 83
  • Thanks, that's a great answer. I personally don't care about this issue, but django-picklefield (https://github.com/shrubberysoft/django-picklefield) does and it uses a weird workaround, that I had to work around in order to make my code work. I agree with the conclusion - one shouldn't depend on pickle to produce same results for the same object. – julx Sep 21 '11 at 15:27
  • 1
    @julkiewicz, ok as long as you've got a good reason. One thing you might find useful is the pickletools.optimize function. When I tried on some of the sample strings you provided it made them the same. – Winston Ewert Sep 21 '11 at 15:29
  • "it just produces a slightly shorter pickle when the reference counts are at 1." In the example linked by the OP, the pickle seems to be *longer* for the case that the reference count is 1. – Sven Marnach Sep 21 '11 at 16:47
  • @SvenMarnach, missed that. Although running those code segments on my machine has the two versions swapped compared to what that comment says. Either a version difference or perhaps the original author of the comment swapped them, I guess. – Winston Ewert Sep 21 '11 at 17:26
  • @Winston: Yeah, the latter seems to be the most likely reason. – Sven Marnach Sep 21 '11 at 17:56
  • @SvenMarnach I think it's not an optimization for the length of the output, it's an optimization for the speed of computation. The slightly longer result is just a side effect. – julx Sep 21 '11 at 19:31