11

I have a database (mysql) where I want to store pickled data.

The data can be for instance a dictionary, which may contain unicode, e.g.

data = {1 : u'é'}

and the database (mysql) is in utf-8.

When I pickle,

import pickle
pickled_data = pickle.dumps(data)
print type(pickled_data) # returns <type 'str'>

the resulting pickled_data is a string.

When I try to store this in a database (e.g. in a Textfield) this can causes problems. In particular, I'm getting at some point a

UnicodeDecodeError "'utf8' codec can't decode byte 0xe9 in position X"

when trying to save the pickled_data in the database. This makes sense because pickled_data can have non-utf-8 characters. My question is how do I store pickled_data on a utf-8 database?

I see two possible candidates:

  1. Encode the result of the pickle.dump to utf-8 and store it. When I want to pickle.load, I have to decode it.

  2. Store the pickled string in binary format (how?), which forces all characters to be within ascii.

My issue is that I'm not seeing what are the consequences of choosing one of this options in the long run. Since the change already requires some effort, I'm driven to ask for an opinion on this issue, asking for eventual better candidates.

(P.S. This is for instance useful in Django)

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Jorge Leitao
  • 19,085
  • 19
  • 85
  • 121
  • Option 3: Store the unicode data as UTF-8 encoded strings. – Martijn Pieters Jun 25 '13 at 17:00
  • Option 4: Use a binary column type instead. – Martijn Pieters Jun 25 '13 at 17:01
  • 2
    Pickle data is *binary* data. You cannot encode that to UTF-8 (a text encoding). – Martijn Pieters Jun 25 '13 at 17:01
  • @MartijnPieters, from python's documentation on pickle: "By default, the pickle data format uses a printable ASCII representation. This is slightly more voluminous than a binary representation.". What do you mean it is binary data? Also, what do you mean with option 3? I always saw pickle has a way of storing data (and not just strings/unicode). – Jorge Leitao Jun 25 '13 at 20:34
  • Why exactly are you trying to pickle Unicode data, when the most natural representation would be to encode it to UTF-8? – Martijn Pieters Jun 25 '13 at 20:36
  • Django has a standard for using unicode and I follow it. The problem happens when I use pickle to store data. It transforms any unicode in ascii representation *while pickled*. I only use the unpickled data on django, but I use the pickled data to store it in the db. – Jorge Leitao Jun 25 '13 at 20:42
  • Are you saying you have a problem with the pickle data *itself* (so the pickle stream format) being transformed? This is *far* from clear. – Martijn Pieters Jun 25 '13 at 20:45
  • 1
    @MartijnPieters, I edited and improved the question. Thanks for the input. – Jorge Leitao Jun 25 '13 at 21:01
  • 1
    I'm not the only one caught out by the wording in the Python 2 documentation for pickle protocol 0 then. I see they have improved the wording in the Python 3 docs to make it clear that all protocols are binary. – foz Mar 19 '15 at 15:49

1 Answers1

19

Pickle data is opaque, binary data, even when you use protocol version 0:

>>> pickle.dumps(data, 0)
'(dp0\nI1\nV\xe9\np1\ns.'

When you try to store that in a TextField, Django will try to decode that data to UTF8 to store it; this is what fails because this is not UTF-8 encoded data; it is binary data instead:

>>> pickled_data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

The solution is to not try to store this in a TextField. Use a BinaryField instead:

A field to store raw binary data. It only supports bytes assignment. Be aware that this field has limited functionality. For example, it is not possible to filter a queryset on a BinaryField value.

You have a bytes value (Python 2 strings are byte strings, renamed to bytes in Python 3).

If you insist on storing the data in a text field, explicitly decode it as latin1; the Latin 1 codec maps bytes one-on-one to Unicode codepoints:

>>> pickled_data.decode('latin1')
u'(dp0\nI1\nV\xe9\np1\ns.'

and make sure you encode it again before unpickling again:

>>> encoded = pickled_data.decode('latin1')
>>> pickle.loads(encoded)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Libraries/buildout.python/parts/opt/lib/python2.7/pickle.py", line 1381, in loads
    file = StringIO(str)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
>>> pickle.loads(encoded.encode('latin1'))
{1: u'\xe9'}

Do note that if you let this value go to the browser and back again in a text field, the browser is likely to have replaced characters in that data. Internet Explorer will replace \n characters with \r\n, for example, because it assumes it is dealing with text.

Not that you ever should allow accepting pickle data from a network connection in any case, because that is a security hole waiting for exploitation.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I will wait for django's binaryField in stable. Meanwhile I will use TextField. Thanks for the answer. – Jorge Leitao Jun 26 '13 at 06:08
  • I do apologize; I had not appreciated that `BinaryField` has only just been added to a dev release. I am surprised Django did not have a binary-data field before. – Martijn Pieters Jun 26 '13 at 13:24
  • 1
    @J.C.Leitão: And in case these pickles are data you accepted from the network, please do read http://www.zopatista.com/plone/2007/11/09/one-cookie-please/; pickle data should **never** be accepted from untrusted sources. – Martijn Pieters Jun 26 '13 at 13:28