2

I need to build a python encoder so that I can reformat strings like this:

import codecs
codecs.encode("Random  UTF-8 String ☑⚠⚡", 'name_of_my_encoder')

The reason this is even something I'm asking stack overflow is, the encoded strings need to pass this validation function. This is a hard constraint, there is no flexibility on this, its due to how the strings have to be stored.

from string import ascii_letters
from string import digits

valid_characters = set(ascii_letters + digits + ['_'])

def validation_function(characters):
    for char in characters:
        if char not in valid_characters:
            raise Exception

Making an encoder seemed easy enough, but I'm not sure if this encoder is making it harder to build a decoder. Heres the encoder I've written.

from codecs import encode
from string import ascii_letters
from string import digits

ALPHANUMERIC_SET = set(ascii_letters + digits)

def underscore_encode(chars_in):
    chars_out = list()
    for char in chars_in:
        if char not in ALPHANUMERIC_SET:
            chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))
        else:
            chars_out.append(char)
    return ''.join(chars_out)

This is the encoder I've written. I've only included it for example purposes, theres probably a better way to do this.

Edit 1 - Someone has wisely pointed out just using base32 on the entire string, which I can definitely use. However, it would be preferable to have something that is 'somewhat readable', so an escaping system like https://en.wikipedia.org/wiki/Quoted-printable or https://en.wikipedia.org/wiki/Percent-encoding would be preferred.

Edit 2 - Proposed solutions must work on Python 3.4 or newer, working in Python 2.7 as well is nice, but not required. I've added the python-3.x tag to help clarify this a little.

Techdragon
  • 502
  • 8
  • 15
  • `chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))` what does this do? – xrisk Aug 16 '15 at 13:31
  • encode the whole binary string as [base 32](https://en.wikipedia.org/wiki/Base32) or [base 64](https://en.wikipedia.org/wiki/Base64) like in [MIME](https://en.wikipedia.org/wiki/MIME) – phuclv Aug 16 '15 at 13:41
  • @RishavKundu It inserts a hex unicode representation of the character between underscores, which are the only character I can reasonably use for an escape sequence. `>>> '_{}_'.format(encode('π'.encode(), 'hex').decode('ascii'))` prints out `'_cf80_'` – Techdragon Aug 16 '15 at 14:17
  • @Techdragon see my answer! Python will do all the work for you! – xrisk Aug 16 '15 at 14:18
  • @RishavKundu You definitely gave me some new ideas for how to try building this, but your code is python 2.x only. I'm unable to use Python 2.x code, I've deprecated it in all of my projects, and any 2.x only code now fails my test suites. Using the b32encode/b32decode requires a bytes object, and the bytes object doesn't concatenate so nicely with strings. which is why I wrote `'_{}_'.format(encode(char.encode(), 'hex').decode('ascii'))` instead of something like `'_{}_'.format(base64.b16encode('π'.encode('utf-8')))` – Techdragon Aug 16 '15 at 15:07
  • Which version of Python? – wwii Aug 16 '15 at 15:13
  • I've edited the question to clarify that I'm primarily looking for solutions that work under Python version 3.4 or higher. – Techdragon Aug 16 '15 at 15:24

5 Answers5

2

Use base32! It uses only the 26 letters of the alphabet and 0-9. You can’t use base64 because it uses the = character, which won’t pass your validator.

>>> import base64
>>>
>>> print base64.b32encode('Random  UTF-8 String ☑⚠⚡"')
KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC
>>>
>>> print base64.b32decode('KJQW4ZDPNUQPBH4QRUQFKVCGFU4CAU3UOJUW4ZZA4KMJDYU2UDRJVIJC')
Random  UTF-8 String ☑⚠⚡"
>>> 
xrisk
  • 3,790
  • 22
  • 45
2

This seems to do the trick. Basically, alphanumeric letters are left alone. Any non-alphanumeric character in the ASCII set is encoded as a \xXX escape code. All other unicode characters are encoded using the \uXXXX escape code. However, you've said you can't use \, but you can use _, thus all escape sequences are translated to start with a _. This makes decoding extremely simple. Just replace the _ with \ and then use the unicode-escape codec. Encoding is slightly more difficult as the unicode-escape codec leaves ASCII characters alone. So first you have to escape the relevant ASCII characters, then run the string through the unicode-escape codec, before finally translating all \ to _.

Code:

from string import ascii_letters, digits

# non-translating characters
ALPHANUMERIC_SET = set(ascii_letters + digits)    
# mapping all bytes to themselves, except '_' maps to '\'
ESCAPE_CHAR_DECODE_TABLE = bytes(bytearray(range(256)).replace(b"_", b"\\"))
# reverse mapping -- maps `\` back to `_`
ESCAPE_CHAR_ENCODE_TABLE = bytes(bytearray(range(256)).replace(b"\\", b"_"))
# encoding table for ASCII characters not in ALPHANUMERIC_SET
ASCII_ENCODE_TABLE = {i: u"_x{:x}".format(i) for i in set(range(128)) ^ set(map(ord, ALPHANUMERIC_SET))}



def encode(s):
    s = s.translate(ASCII_ENCODE_TABLE) # translate ascii chars not in your set
    bytes_ = s.encode("unicode-escape")
    bytes_ = bytes_.translate(ESCAPE_CHAR_ENCODE_TABLE)
    return bytes_

def decode(s):
    s = s.translate(ESCAPE_CHAR_DECODE_TABLE)
    return s.decode("unicode-escape")

s = u"Random UTF-8 String ☑⚠⚡"
#s = '北亰'
print(s)
b = encode(s)
print(b)
new_s = decode(b)
print(new_s)

Which outputs:

Random UTF-8 String ☑⚠⚡
b'Random_x20UTF_x2d8_x20String_x20_u2611_u26a0_u26a1'
Random UTF-8 String ☑⚠⚡

This works on both python 3.4 and python 2.7, which is why the ESCAPE_CHAR_{DE,EN}CODE_TABLE is a bit messy bytes on python 2.7 is an alias for str, which works differently to bytes on python 3.4. This is why the table is constructed using a bytearray. For python 2.7, the encode method expects a unicode object not str.

Dunes
  • 37,291
  • 7
  • 81
  • 97
1

You could abuse the url quoting, to get both readable and easy to decode in other languages format that passes your validation function:

#!/usr/bin/env python3
import urllib.parse

def alnum_encode(text):
    return urllib.parse.quote(text, safe='')\
        .replace('-', '%2d').replace('.', '%2e').replace('_', '%5f')\
        .replace('%', '_')

def alnum_decode(underscore_encoded):
    return urllib.parse.unquote(underscore_encoded.replace('_','%'), errors='strict')

s = alnum_encode("Random  UTF-8 String ☑⚠⚡")
print(s)
print(alnum_decode(s))

Output

Random_20_F0_9F_90_8D_20UTF_2d8_20String_20_E2_98_91_E2_9A_A0_E2_9A_A1
Random  UTF-8 String ☑⚠⚡

Here's an implementation using a bytearray() (to move it to C later if necessary):

#!/usr/bin/env python3.5
from string import ascii_letters, digits

def alnum_encode(text, alnum=bytearray(ascii_letters+digits, 'ascii')):
    result = bytearray()
    for byte in bytearray(text, 'utf-8'):
        if byte in alnum:
            result.append(byte)
        else:
            result += b'_%02x' % byte
    return result.decode('ascii')
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    With the downside of requiring much much more space to store the encoded form. – xrisk Aug 17 '15 at 08:55
  • It seems the space is not an issue: `len(alnum_encode("Random UTF-8 String ☑⚠⚡")) == len(underscore_encode("Random UTF-8 String ☑⚠⚡"))` where [`underscore_encode()` is from the accepted answer](http://stackoverflow.com/a/32335031/4279) – jfs Sep 04 '15 at 19:57
1

Despite several good answers. I ended up with a solution that seems cleaner and more understandable. So I'm posting the code of my eventual solution to answer my own question.

from string import ascii_letters
from string import digits
from base64 import b16decode
from base64 import b16encode


ALPHANUMERIC_SET = set(ascii_letters + digits)


def utf8_string_to_hex_string(s):
    return ''.join(chr(i) for i in b16encode(s.encode('utf-8')))


def hex_string_to_utf8_string(s):
    return b16decode(bytes(list((ord(i) for i in s)))).decode('utf-8')


def underscore_encode(chars_in):
    chars_out = list()
    for char in chars_in:
        if char not in ALPHANUMERIC_SET:
            chars_out.append('_{}_'.format(utf8_string_to_hex_string(char)))
        else:
            chars_out.append(char)
    return ''.join(chars_out)


def underscore_decode(chars_in):
    chars_out = list()
    decoding = False
    for char in chars_in:
        if char == '_':
            if not decoding:
                hex_chars = list()
                decoding = True
            elif decoding:
                decoding = False
                chars_out.append(hex_string_to_utf8_string(hex_chars))
        else:
            if not decoding:
                chars_out.append(char)
            elif decoding:
                hex_chars.append(char)
    return ''.join(chars_out)
Techdragon
  • 502
  • 8
  • 15
0

If you want a transliteration of Unicode to ASCII (e.g. ç --> c), then check out the Unidecode package. Here are their examples:

>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode(u"\u5317\u4EB0")
'Bei Jing '

Here's my example:

# -*- coding: utf-8 -*- 
from unidecode import unidecode
print unidecode(u'快樂星期天')

Gives as an output*

Kuai Le Xing Qi Tian 

*may be nonsense, but at least it's ASCII


To remove punctuation, see this answer.

Community
  • 1
  • 1
philshem
  • 24,761
  • 8
  • 61
  • 127