How to encode UTF-8 strings with only "A-Z","a-z","0-9", and "_" in Python

Question

I need to build a python encoder so that I can reformat strings like this:

import codecs
codecs.encode("Random  UTF-8 String ☑⚠⚡", 'name_of_my_encoder')

The reason this is even something I'm asking stack overflow is, the encoded strings need to pass this validation function. This is a hard constraint, there is no flexibility on this, its due to how the strings have to be stored.

from string import ascii_letters
from string import digits

valid_characters = set(ascii_letters + digits + ['_'])

def validation_function(characters):
    for char in characters:
        if char not in valid_characters:
            raise Exception

Making an encoder seemed easy enough, but I'm not sure if this encoder is making it harder to build a decoder. Heres the encoder I've written.

from codecs import encode
from string import ascii_letters
from string import digits

ALPHANUMERIC_SET = set(ascii_letters + digits)

def underscore_encode(chars_in):
    chars_out = list()
    for char in chars_in:
        if char not in ALPHANUMERIC_SET:
            chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))
        else:
            chars_out.append(char)
    return ''.join(chars_out)

This is the encoder I've written. I've only included it for example purposes, theres probably a better way to do this.

Edit 1 - Someone has wisely pointed out just using base32 on the entire string, which I can definitely use. However, it would be preferable to have something that is 'somewhat readable', so an escaping system like https://en.wikipedia.org/wiki/Quoted-printable or https://en.wikipedia.org/wiki/Percent-encoding would be preferred.

Edit 2 - Proposed solutions must work on Python 3.4 or newer, working in Python 2.7 as well is nice, but not required. I've added the python-3.x tag to help clarify this a little.

`chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))` what does this do? — xrisk, Aug 16 '15 at 13:31
encode the whole binary string as [base 32](https://en.wikipedia.org/wiki/Base32) or [base 64](https://en.wikipedia.org/wiki/Base64) like in [MIME](https://en.wikipedia.org/wiki/MIME) — phuclv, Aug 16 '15 at 13:41
@RishavKundu It inserts a hex unicode representation of the character between underscores, which are the only character I can reasonably use for an escape sequence. `>>> '_{}_'.format(encode('π'.encode(), 'hex').decode('ascii'))` prints out `'_cf80_'` — Techdragon, Aug 16 '15 at 14:17
@Techdragon see my answer! Python will do all the work for you! — xrisk, Aug 16 '15 at 14:18
@RishavKundu You definitely gave me some new ideas for how to try building this, but your code is python 2.x only. I'm unable to use Python 2.x code, I've deprecated it in all of my projects, and any 2.x only code now fails my test suites. Using the b32encode/b32decode requires a bytes object, and the bytes object doesn't concatenate so nicely with strings. which is why I wrote `'_{}_'.format(encode(char.encode(), 'hex').decode('ascii'))` instead of something like `'_{}_'.format(base64.b16encode('π'.encode('utf-8')))` — Techdragon, Aug 16 '15 at 15:07
I've edited the question to clarify that I'm primarily looking for solutions that work under Python version 3.4 or higher. — Techdragon, Aug 16 '15 at 15:24

xrisk · Answer 1 · 2015-08-17T10:04:03.190

2

Use base32! It uses only the 26 letters of the alphabet and 0-9. You can’t use base64 because it uses the = character, which won’t pass your validator.

>>> import base64
>>>
>>> print base64.b32encode('Random  UTF-8 String ☑⚠⚡"')
KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC
>>>
>>> print base64.b32decode('KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC')
Random  UTF-8 String ☑⚠⚡"
>>>

edited Aug 17 '15 at 10:04

answered Aug 16 '15 at 14:16

xrisk

3,790
22
45

This only behaves as expected in Python-2.x – Techdragon Aug 16 '15 at 15:11
@Techdragon: It should be trivial to adapt it for Python 3. If you don't know how; ask a separate question: include working Python 2 code and example input output. – jfs Aug 16 '15 at 23:24
the thing is his set of allowed characters has only 63 different values, not 64 – phuclv Aug 17 '15 at 09:59
yeah. I also thought of using base64 at first, but I've just had a look back on this and notice the set is not enough – phuclv Aug 17 '15 at 11:21
Isn't symbol = used in base32 too? – Chen Zhuo Jul 16 '22 at 02:54

score 2 · Answer 2 · answered Aug 16 '15 at 17:49

This seems to do the trick. Basically, alphanumeric letters are left alone. Any non-alphanumeric character in the ASCII set is encoded as a \xXX escape code. All other unicode characters are encoded using the \uXXXX escape code. However, you've said you can't use \, but you can use _, thus all escape sequences are translated to start with a _. This makes decoding extremely simple. Just replace the _ with \ and then use the unicode-escape codec. Encoding is slightly more difficult as the unicode-escape codec leaves ASCII characters alone. So first you have to escape the relevant ASCII characters, then run the string through the unicode-escape codec, before finally translating all \ to _.

Code:

from string import ascii_letters, digits

# non-translating characters
ALPHANUMERIC_SET = set(ascii_letters + digits)    
# mapping all bytes to themselves, except '_' maps to '\'
ESCAPE_CHAR_DECODE_TABLE = bytes(bytearray(range(256)).replace(b"_", b"\\"))
# reverse mapping -- maps `\` back to `_`
ESCAPE_CHAR_ENCODE_TABLE = bytes(bytearray(range(256)).replace(b"\\", b"_"))
# encoding table for ASCII characters not in ALPHANUMERIC_SET
ASCII_ENCODE_TABLE = {i: u"_x{:x}".format(i) for i in set(range(128)) ^ set(map(ord, ALPHANUMERIC_SET))}



def encode(s):
    s = s.translate(ASCII_ENCODE_TABLE) # translate ascii chars not in your set
    bytes_ = s.encode("unicode-escape")
    bytes_ = bytes_.translate(ESCAPE_CHAR_ENCODE_TABLE)
    return bytes_

def decode(s):
    s = s.translate(ESCAPE_CHAR_DECODE_TABLE)
    return s.decode("unicode-escape")

s = u"Random UTF-8 String ☑⚠⚡"
#s = '北亰'
print(s)
b = encode(s)
print(b)
new_s = decode(b)
print(new_s)

Which outputs:

Random UTF-8 String ☑⚠⚡
b'Random_x20UTF_x2d8_x20String_x20_u2611_u26a0_u26a1'
Random UTF-8 String ☑⚠⚡

This works on both python 3.4 and python 2.7, which is why the ESCAPE_CHAR_{DE,EN}CODE_TABLE is a bit messy bytes on python 2.7 is an alias for str, which works differently to bytes on python 3.4. This is why the table is constructed using a bytearray. For python 2.7, the encode method expects a unicode object not str.

jfs · Answer 3 · 2015-09-04T19:59:14.803

You could abuse the url quoting, to get both readable and easy to decode in other languages format that passes your validation function:

#!/usr/bin/env python3
import urllib.parse

def alnum_encode(text):
    return urllib.parse.quote(text, safe='')\
        .replace('-', '%2d').replace('.', '%2e').replace('_', '%5f')\
        .replace('%', '_')

def alnum_decode(underscore_encoded):
    return urllib.parse.unquote(underscore_encoded.replace('_','%'), errors='strict')

s = alnum_encode("Random  UTF-8 String ☑⚠⚡")
print(s)
print(alnum_decode(s))

Output

Random_20_F0_9F_90_8D_20UTF_2d8_20String_20_E2_98_91_E2_9A_A0_E2_9A_A1
Random  UTF-8 String ☑⚠⚡

Here's an implementation using a bytearray() (to move it to C later if necessary):

#!/usr/bin/env python3.5
from string import ascii_letters, digits

def alnum_encode(text, alnum=bytearray(ascii_letters+digits, 'ascii')):
    result = bytearray()
    for byte in bytearray(text, 'utf-8'):
        if byte in alnum:
            result.append(byte)
        else:
            result += b'_%02x' % byte
    return result.decode('ascii')

With the downside of requiring much much more space to store the encoded form. — xrisk, Aug 17 '15 at 08:55
It seems the space is not an issue: `len(alnum_encode("Random UTF-8 String ☑⚠⚡")) == len(underscore_encode("Random UTF-8 String ☑⚠⚡"))` where [`underscore_encode()` is from the accepted answer](http://stackoverflow.com/a/32335031/4279) — jfs, Sep 04 '15 at 19:57

score 1 · Accepted Answer · answered Sep 01 '15 at 15:05

Despite several good answers. I ended up with a solution that seems cleaner and more understandable. So I'm posting the code of my eventual solution to answer my own question.

from string import ascii_letters
from string import digits
from base64 import b16decode
from base64 import b16encode


ALPHANUMERIC_SET = set(ascii_letters + digits)


def utf8_string_to_hex_string(s):
    return ''.join(chr(i) for i in b16encode(s.encode('utf-8')))


def hex_string_to_utf8_string(s):
    return b16decode(bytes(list((ord(i) for i in s)))).decode('utf-8')


def underscore_encode(chars_in):
    chars_out = list()
    for char in chars_in:
        if char not in ALPHANUMERIC_SET:
            chars_out.append('_{}_'.format(utf8_string_to_hex_string(char)))
        else:
            chars_out.append(char)
    return ''.join(chars_out)


def underscore_decode(chars_in):
    chars_out = list()
    decoding = False
    for char in chars_in:
        if char == '_':
            if not decoding:
                hex_chars = list()
                decoding = True
            elif decoding:
                decoding = False
                chars_out.append(hex_string_to_utf8_string(hex_chars))
        else:
            if not decoding:
                chars_out.append(char)
            elif decoding:
                hex_chars.append(char)
    return ''.join(chars_out)

score 0 · Answer 5 · edited May 23 '17 at 10:26

0

If you want a transliteration of Unicode to ASCII (e.g. ç --> c), then check out the Unidecode package. Here are their examples:

>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode(u"\u5317\u4EB0")
'Bei Jing '

Here's my example:

# -*- coding: utf-8 -*- 
from unidecode import unidecode
print unidecode(u'快樂星期天')

Gives as an output*

Kuai Le Xing Qi Tian

*may be nonsense, but at least it's ASCII

To remove punctuation, see this answer.

edited May 23 '17 at 10:26

Community

1
1

answered Aug 16 '15 at 13:29

philshem

24,761
8
61
127

This encoding doesn't produce output that will always pass the validator function. – Techdragon Aug 16 '15 at 15:58

How to encode UTF-8 strings with only "A-Z","a-z","0-9", and "_" in Python

5 Answers5

Output