Pickling dict in Python

Question

Can I expect the string representation of the same pickled dict to be consistent across different machines/runs for the same Python version? In the scope of one run on the same machine?

e.g.

# Python 2.7

import pickle
initial = pickle.dumps({'a': 1, 'b': 2})
for _ in xrange(1000**2):
    assert pickle.dumps({'a': 1, 'b': 2}) == initial

Does it depend on the actual structure of my dict object (nested values etc.)?

UPD: The thing is - I can't actually make the code above fail in the scope of one run (Python 2.7) no matter how my dict object looks like (what keys/values etc.)

Most definitely not. Do you have a good reason to use the string representation? You use `xrange` which means Python 2, in which order of keys in dictionary is arbitrary (which render the string representation useless). — DeepSpace, Oct 23 '18 at 12:23
the code above works (one run ofc), so rly "useless"? also I want to understand such a behavior, so I have a pretty good reason for such a question :) — d-d, Oct 23 '18 at 12:25
The pickled representation of your dict can vary, as has been pointed out. However, the unpicked dict will compare as equal to the original, even in such cases, and isn't that what really matters? — jasonharper, Oct 23 '18 at 12:35
nope, in my question the string representation of a pickled object matters, that's it — d-d, Oct 23 '18 at 12:36
If you need to maintain order **and** maintain data types, why not use `collections.OrderedDict`? — jpp, Oct 25 '18 at 15:10
Completely irrelevant, but your alias line could be simply `pickle = dumps`. No need for a lambda if all you're doing is passing on the same number of args — GP89, Oct 26 '18 at 15:25

Martijn Pieters · Accepted Answer · 2018-10-26T15:27:07.497

You can't in the general case, for the same reasons you can't rely on the dictionary order in other scenarios; pickling is not special here. The string representation of a dictionary is a function of the current dictionary iteration order, regardless of how you loaded it.

Your own small test is too limited, because it doesn't do any mutation of the test dictionary and doesn't use keys that would cause collisions. You create dictionaries with the exact same Python source code, so those will produce the same output order because the editing history of the dictionaries is exactly the same, and two single-character keys that use consecutive letters from the ASCII character set are not likely to cause a collision.

Not that you actually test string representations being equal, you only test if their contents are the same (two dictionaries that differ in string representation can still be equal because the same key-value pairs, subjected to a different insertion order, can produce different dictionary output order).

Next, the most important factor in the dictionary iteration order before cPython 3.6 is the hash key generation function, which must be stable during a single Python executable lifetime (or otherwise you'd break all dictionaries), so a single-process test would never see dictionary order change on the basis of different hash function results.

Currently, all pickling protocol revisions store the data for a dictionary as a stream of key-value pairs; on loading the stream is decoded and key-value pairs are assigned back to the dictionary in the on-disk order, so the insertion order is at least stable from that perspective. BUT between different Python versions, machine architectures and local configuration, the hash function results absolutely will differ:

The PYTHONHASHSEED environment variable, is used in the generation of hashes for str, bytes and datetime keys. The setting is available as of Python 2.6.8 and 3.2.3, and is enabled and set to random by default as of Python 3.3. So the setting varies from Python version to Python version, and can be set to something different locally.
The hash function produces a ssize_t integer, a platform-dependent signed integer type, so different architectures can produce different hashes just because they use a larger or smaller ssize_t type definition.

With different hash function output from machine to machine and from Python run to Python run, you will see different string representations of a dictionary.

And finally, as of cPython 3.6, the implementation of the dict type changed to a more compact format that also happens to preserve insertion order. As of Python 3.7, the language specification has changed to make this behaviour mandatory, so other Python implementations have to implement the same semantics. So pickling and unpickling between different Python implementations or versions predating Python 3.7 can also result in a different dictionary output order, even with all other factors equal.

Slam · Answer 2 · 2018-10-26T08:16:43.640

2

No, you cannot. This depends on lot of things, including key values, interpreter state and python version.

If you need consistent representation, consider using JSON with canonical form.

EDIT

I'm not quite sure why people downvoting this without any comments, but I'll clarify.

pickle is not meant to produce reliable representations, its pure machine-(not human-) readable serializer.

Python version backward/forward compatibility is a thing, but it applies only for ability to deserialize identic object inside interpreter — i.e. when you dump in one version and load in another, it's guaranteed to have have same behaviour of same public interfaces. Neither serialized text representation or internal memory structure claimed to be the same (and IIRC, it never did).

Easiest way to check this is to dump same data in versions with significant difference in structure handling and/or seed handling while keeping your keys out of cached range (no short integers nor strings):

Python 3.5.6 (default, Oct 26 2018, 11:00:52) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> d = {'first_string_key': 1, 'second_key_string': 2}
>>> pickle.dump
>>> pickle.dumps(d)
b'\x80\x03}q\x00(X\x11\x00\x00\x00second_key_stringq\x01K\x02X\x10\x00\x00\x00first_string_keyq\x02K\x01u.'

Python 3.6.7 (default, Oct 26 2018, 11:02:59) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> d = {'first_string_key': 1, 'second_key_string': 2}
>>> pickle.dumps(d)
b'\x80\x03}q\x00(X\x10\x00\x00\x00first_string_keyq\x01K\x01X\x11\x00\x00\x00second_key_stringq\x02K\x02u.'

edited Oct 26 '18 at 08:16

answered Oct 23 '18 at 12:20

Slam

8,112
1
36
44

do you mean sorted keys (in case of JSON)? – d-d Oct 23 '18 at 12:21
1

Yes, in case of stdlib serializer its `sort_keys=True` in https://docs.python.org/3/library/json.html#json.dump – Slam Oct 23 '18 at 12:22
it's a good workaround, yeah. but the thing is - JSON can't handle all the data types I need, that's why I was looking for another way to get the string representation of an object (list, dict, class instance etc.) – d-d Oct 23 '18 at 12:27
1

@d-d the json serializer (and deserializer) can be extended to support more types. It will of course never totally replace `pickle` but depending on your effective needs it might still be a working solution. – bruno desthuilliers Oct 23 '18 at 12:36
yes, good point, and basically I did it many times, but the goal of this discussion is to understand how pickling works in Python, not to find a workaround – d-d Oct 23 '18 at 12:39
1

@d-d your query is still answered here, the answer is NO! because it depends on a lot of things. Pickling is just a nicely formatted internal representation dump of objects, you cannot assume ordering of items in a hashed set being dumped in the same order on every machine. It is machine dependent, the issue is not so much pickle as much as it is python internal implementation of certain types like dict and set for example. For example you cannot assume str({'a': 1, 'b': 2}) == str({'a': 1, 'b': 2}) when each of the str calls are done on a different machine – Pykler Oct 25 '18 at 15:06

t.m.adam · Answer 3 · 2018-10-26T13:53:35.480

Python2 dictinaries are unordered; the order depends on the hash values of keys as explained in this great answer by Martijn Pieters. I don't think you can use a dict here, but you could use an OrderedDict (requires Python 2.7 or higher) which maintains the order of the keys. For example,

from collections import OrderedDict

data = [('b', 0), ('a', 0)]
d = dict(data)
od = OrderedDict(data)

print(d)
print(od)

#{'a': 0, 'b': 0}
#OrderedDict([('b', 0), ('a', 0)])

You can pickle an OrderedDict like you would pickle a dict, but order would be preserved, and the resulting string would be the same when pickling same objects.

from collections import OrderedDict
import pickle

data = [('a', 1), ('b', 2)]
od = OrderedDict(data)
s = pickle.dumps(od)
print(s)

Note that you shouldn't pass a dict in OrderedDict's constructor as the keys would be already placed. If you have a dictionary, you should first convert it to tuples with the desired order. OrderedDict is a subclass of dict and has all the dict methods, so you could create an empty object and assign new keys.

Your test doesn't fail because you're using the same Python version and the same conditions - the order of the dictionary will not change randomly between loop iterations. But we can demonstrate how your code fails to produce differend strings when we change the order of keys in the dictionary.

import pickle

initial = pickle.dumps({'a': 1, 'b': 2})
assert pickle.dumps({'b': 2, 'a': 1}) != initial

The resulting string should be different when we put key 'b' first (it would be different in Python >= 3.6), but in Python2 it's the same because key 'a' is placed before key 'b'.

To answer your main question, Python2 dictionaries are unordered, but a dictionary is likely to have the same order when using the same code and Python version. However that order may not be the same as the order in which you placed the items in the dictionary. If the order is important it's best to use an OrderedDict or update your Python version.

Hi @t.m.adam, hope you are doing great. I would be very glad if you give [this post](https://stackoverflow.com/questions/64356140/cant-fetch-a-number-populated-dynamically-from-a-webpage-after-following-some-s) a look in case you can offer any solution. Thanks. — robots.txt, Oct 14 '20 at 16:23

score 1 · Answer 4 · answered Oct 26 '18 at 04:19

As with a frustratingly large number of things in Python, the answer is "sort of". Straight from the docs,

The pickle serialization format is guaranteed to be backwards compatible across Python releases.

That's potentially ever so subtly different from what you're asking. If it's a valid pickled dictionary now, it'll always be a valid pickled dictionary, and it'll always deserialize to the correct dictionary. That leaves unspoken a few properties which you might expect and which don't have to hold:

Pickling doesn't have to be deterministic, even for the same object in the same Python instance on the same platform. The same dictionary could have infinitely many possible pickled representations (not that we would expect the format to ever be inefficient enough to support arbitrarily large degrees of extra padding). As the other answers point out, dictionaries don't have a defined sort order, and this can give at least n! string representations of a dictionary with n elements.
Going further with the last point, it isn't guaranteed that pickle is consistent even in a single Python instance. In practice those changes don't currently happen, but that behavior isn't guaranteed to remain in future versions of Python.
Future versions of Python don't need to serialize dictionaries in a way which is compatible with current versions. The only promise we have is that they will be able to correctly deserialize our dictionaries. Currently dictionaries are supported the same in all Pickle formats, but that need not remain the case forever (not that I suspect it would ever change).

Just because pickling is backwards compatible doesn't mean that the same dictionary order will be produced when you load a pickle file. — Martijn Pieters, Oct 26 '18 at 14:58
@MartijnPieters I think we're in agreement. Should I restructure/reformat my answer to make that more clear? — Hans Musgrave, Oct 26 '18 at 15:13

PM 2Ring · Answer 5 · 2018-10-23T12:40:23.760

0

If you don't modify the dict its string representation won't change during a given run of the program, and its .keys method will return the keys in the same order. However, the order can change from run to run (before Python 3.6).

Also, two different dict objects that have identical key-value pairs are not guaranteed to use the same order (pre Python 3.6).

BTW, it's not a good idea to shadow a module name with your own variables, like you do with that lambda. It makes the code harder to read, and will lead to confusing error messages if you forget that you shadowed the module & try to access some other name from it later in the program.

edited Oct 23 '18 at 12:40

answered Oct 23 '18 at 12:28

PM 2Ring

54,345
6
82
182

does it mean that the pickled representation also will stay the same (considering I dont' have complex/nested values, let's say - integers)? – d-d Oct 23 '18 at 12:30
I believe, at some point (3.5?) key order even _forced_ to be different every run – Slam Oct 23 '18 at 12:31
1

@d-d `pickle` doesn't care about the string repr, it uses `.items`. But your code should not depend on pickle maintaining key order (before Python 3.6). – PM 2Ring Oct 23 '18 at 12:34
1

@Slam The exact behaviour depends on the setting of the PYTHONHASHSEED environment variable. IIRC, from 3.2 to 3.5, the default is to use a random hash seed, to prevent DOS attacks when dicts are used on servers. – PM 2Ring Oct 23 '18 at 12:35
when I say "string representation" in this case I mean the string, that is returned after `dumps` method called – d-d Oct 23 '18 at 12:38
@d-d Ah, ok. I did consider that possibility; I normally use the slightly more compact binary pickle protocols. – PM 2Ring Oct 23 '18 at 12:46
@d-d You can use protocol 2, see the docs for details. I mostly write Python 3 these days, which has a few more protocols. BTW, in Python 2, you can use cPickle, which runs faster than plain pickle, which is written in Python. – PM 2Ring Oct 23 '18 at 12:54
`Also, two different dict objects that have identical key-value pairs are not guaranteed to use the same order (pre Python 3.6).` Isn't this true also post-3.6? IIUC, this just says `d1 == d2` doesn't guarantee consistent internal ordering. – jpp Oct 25 '18 at 12:25
@jpp Indeed. Even with Python 3.6+, the orders of d1 & d2 will only be the same if the keys were added in the same order. – PM 2Ring Oct 25 '18 at 14:00

Pickling dict in Python

5 Answers5