Dogma?
It's not dogmatic why character sets and encodings are used - it's a necessity.
Hopefully, you will have read enough to understand why we have so many character sets in use. Unicode is obviously the way forward (having all characters mapped), but how do you transfer a Unicode character from one machine to another, or save it to disk?
We could use the Unicode point value, but as Unicode points are effectively 32bits, each character would need to be saved/transferred as the whole 32bits (aka UTF-32). a
would be encoded as 0x00000061
- that's a lot of wasted bits just for one character. UTF-16 is a little less wasteful when dealing with mostly ASCII, but UTF-8 is the best compromise by using the least amount of bits.
Using decoded Unicode within code obviously frees developers from having to consider the intricacies of encoding, such as how many bytes equal a character.
Solutions
Redis Client
As suggested by @J.F.Sebastian, the redis-py driver includes a decode_responses
option on the Redis
and Connection
classes. When set to True
the client will decode the responses using the encoding
option. By default encoding = utf-8
.
E.g.
r_server = redis.Redis('localhost', decode_responses=True)
city = r_server.get('city')
# city = <type 'unicode'>
Wrapper Class
No longer required since discovery of decode_responses
.
It would appear that the Redis driver is rather simplistic - it so happens that if you send a Unicode it'll convert it to the default encoding (UTF-8 is most cases). On response, Redis doesn't know the encoding so returns an str
for you to decode as appropriate.
Therefore, if would be safer to encode your strings to UTF-8 before sending to Redis and decode as UTF-8 on response. Other DB drivers are more advanced, so receive and return Unicodes.
But of course, you shouldn't be peppering your code with .encode()
and .decode()
. The common approach is to form "Unicode sandwiches", so that external data is decoded to Unicode on input and encoded on output. So how does that work for you? Wrap the Redis driver so that it returns what you want, thereby pushing the decoding back into the periphery of your code.
For example, it should be as simple as:
class UnicodeRedis(redis.Redis):
def __init__(self, *args, **kwargs):
if "encoding" in kwargs:
self.encoding = kwargs["encoding"]
else:
self.encoding = "utf-8"
super(UnicodeRedis, self).__init__(*args, **kwargs)
def get(self, *args, **kwargs):
result = super(UnicodeRedis, self).get(*args, **kwargs)
if isinstance(result, str):
return result.decode(self.encoding)
else:
return result
You can then interact with it as normal except that you can pass an encoding
argument that changes how strings are decoded. If you don't set encoding
, this code will assume utf-8
.
E.g.
r_server = UnicodeRedis('localhost')
city = r_server.get('city')