6

After two questions regarding the distinction between the datatypes str and unicode, I'm still puzzled at the following.

In Block 1 we see that the type of the city is unicode, as we're expecting.

Yet in Block 2, after a round-trip through disk (redis), the type of the city is str (and the representation is different).

The dogma of storing utf-8 on disk, reading into unicode, and writing back in utf-8 is failing somewhere.

Why is the second instance of type(city) str rather than unicode?

Just as importantly, does it matter? Do you care whether your variables are unicode or str, or are you oblivious to the difference just so long as the code "does the right thing"?

# -*- coding: utf-8 -*-

# Block 1
city = u'Düsseldorf'
print city, type(city), repr(city)
# Düsseldorf <type 'unicode'> u'D\xfcsseldorf'

# Block 2
import redis
r_server = redis.Redis('localhost')
r_server.set('city', city)
city = r_server.get('city')
print city, type(city), repr(city)
# Düsseldorf <type 'str'> 'D\xc3\xbcsseldorf'
Community
  • 1
  • 1
Calaf
  • 10,113
  • 15
  • 57
  • 120
  • 1
    Well, it means that the Redis API returns a binary `str` and not a Unicode object. There's nothing necessarily wrong, it's just what the API does. – deceze Mar 01 '16 at 15:15
  • 1
    Possible duplicate of http://stackoverflow.com/questions/10599147/save-unicode-in-redis-but-fetch-error – cdarke Mar 01 '16 at 15:18
  • You should watch the talk given by Ned Batchelder on ["pragmatic unicode"](http://nedbatchelder.com/text/unipain.html) – mgilson Mar 01 '16 at 16:07
  • @mgilson I did. That's the dogma I'm referring to. (I'm not using that word disparagingly, merely to signal that it's useful to distill his advice until one understands enough how "to unicode" one's code throughout.) – Calaf Mar 01 '16 at 17:28
  • @Calaf -- Ah, so you did. Sorry, I didn't follow the link. In this case, you'd probably want to decode into bytes when you server.set and then encode the result: `r_server.set('city', city.decode('utf-8')); city = r_server.get('city').encode('utf-8')` – mgilson Mar 01 '16 at 17:31
  • @mgilson What would you advise someone who naïvely already wrote substantial code while having the most marginal concept of what unicode entails, and then learned unicode afterwards. This answer http://stackoverflow.com/a/35712646/704972 makes it conceivable that retroactively fitting unicode on non-unicoded code may be doable without peppering the code with .decode(..)/.encode(..). I'm still vainly hoping it's true! – Calaf Mar 01 '16 at 17:44
  • @Calaf -- I don't deal with unicode a whole lot, so I'm probably not the best one to answer your question ... With that said, I've definitely felt the pain of not handling unicode well in my codebases so I think that the effort to go through and make the unicode sandwiches is worth it. Looking at that post, it appears that `redis`'s `decode_responses` might help make that easier (you don't need to do the decoding yourself). – mgilson Mar 01 '16 at 17:48
  • @Calaf I think you got the encode and decode swapped, you can encode `u'Düsseldorf'` to bytes, and decode when reading back. Note that Python 3 will raise an error if you make these mistakes, which makes debugging any problems a lot easier. – roeland Mar 01 '16 at 19:29

2 Answers2

17

Dogma?

It's not dogmatic why character sets and encodings are used - it's a necessity. Hopefully, you will have read enough to understand why we have so many character sets in use. Unicode is obviously the way forward (having all characters mapped), but how do you transfer a Unicode character from one machine to another, or save it to disk?

We could use the Unicode point value, but as Unicode points are effectively 32bits, each character would need to be saved/transferred as the whole 32bits (aka UTF-32). a would be encoded as 0x00000061 - that's a lot of wasted bits just for one character. UTF-16 is a little less wasteful when dealing with mostly ASCII, but UTF-8 is the best compromise by using the least amount of bits.

Using decoded Unicode within code obviously frees developers from having to consider the intricacies of encoding, such as how many bytes equal a character.

Solutions

Redis Client

As suggested by @J.F.Sebastian, the redis-py driver includes a decode_responses option on the Redis and Connection classes. When set to True the client will decode the responses using the encoding option. By default encoding = utf-8.

E.g.

r_server = redis.Redis('localhost', decode_responses=True)
city = r_server.get('city')
# city = <type 'unicode'>

Wrapper Class

No longer required since discovery of decode_responses.

It would appear that the Redis driver is rather simplistic - it so happens that if you send a Unicode it'll convert it to the default encoding (UTF-8 is most cases). On response, Redis doesn't know the encoding so returns an str for you to decode as appropriate.

Therefore, if would be safer to encode your strings to UTF-8 before sending to Redis and decode as UTF-8 on response. Other DB drivers are more advanced, so receive and return Unicodes.

But of course, you shouldn't be peppering your code with .encode() and .decode(). The common approach is to form "Unicode sandwiches", so that external data is decoded to Unicode on input and encoded on output. So how does that work for you? Wrap the Redis driver so that it returns what you want, thereby pushing the decoding back into the periphery of your code.

For example, it should be as simple as:

class UnicodeRedis(redis.Redis):

    def __init__(self, *args, **kwargs):
        if "encoding" in kwargs:
            self.encoding = kwargs["encoding"]
        else:
            self.encoding = "utf-8"
        super(UnicodeRedis, self).__init__(*args, **kwargs)

    def get(self, *args, **kwargs):
        result = super(UnicodeRedis, self).get(*args, **kwargs)
        if isinstance(result, str):
            return result.decode(self.encoding)
        else:
            return result

You can then interact with it as normal except that you can pass an encoding argument that changes how strings are decoded. If you don't set encoding, this code will assume utf-8.

E.g.

r_server = UnicodeRedis('localhost')
city = r_server.get('city')

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • 1
    1- A byte is as much an abstraction as a Unicode codepoint (consider any network API on a system where a byte is not octet). OS can provide Unicode API for files, network, etc (Windows does it in many cases). And it is certainly a deficiency that Redis Python bindings return text as binary blobs. 2- `sys.maxunicode.bit_length()` is beside the point here e.g., Python 3.3+ uses a flexible internal representation. Or even simpler: a library can provide Unicode API while encoding/decoding internally to whatever representation is the most useful in a particular case (as your `UnicodeRedis` does) – jfs Mar 02 '16 at 11:37
  • 3
    Passing `decode_responses=True` could be used instead encoding/decoding manually in UnicodeRedis. – jfs Mar 02 '16 at 11:58
  • Ah, thanks @J.F.Sebastian. I searched high and low for such a property. – Alastair McCormack Mar 02 '16 at 12:01
  • By "dogma" I meant "a tidbit of useful, easily remembered, knowledge", and not, as you seem to have interpreted, as less than very accurate or unnecessary one. :) – Calaf Mar 05 '16 at 17:20
  • Sorry @calaf, I understand "dogma" as defined here: http://dictionary.reference.com/browse/dogma . If only dogma was just a tidbit of useful :) – Alastair McCormack Mar 05 '16 at 18:29
  • where is "encoding" declared in the line "return result.decode(encoding)" – Yijin Dec 28 '16 at 11:48
  • @Yijin good spot - I neglected the `self` namespace. It should read: `self.encoding`. I've now fixed the code sample – Alastair McCormack Dec 28 '16 at 12:10
  • @Yijin I've now updated the answer to use J.F.Sebastian's suggestion – Alastair McCormack Dec 28 '16 at 12:27
0

As J.F. Sebastian stated, redis-py API supports decoding responses to unicode by setting decode_response=True in the init method of redis.Redis class.