31

Given a Unicode string and these requirements:

  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
  • The encoded string has a maximum length

For example, the iPhone push service requires JSON encoding with a maximum total packet size of 256 bytes.

What is the best way to truncate the string so that it re-encodes to valid Unicode and that it displays reasonably correctly?

(Human language comprehension is not necessary—the truncated version can look odd e.g. for an orphaned combining character or a Thai vowel, just as long as the software doesn't crash when handling the data.)

See Also:

Community
  • 1
  • 1
JasonSmith
  • 72,674
  • 22
  • 123
  • 149

5 Answers5

36
def unicode_truncate(s, length, encoding='utf-8'):
    encoded = s.encode(encoding)[:length]
    return encoded.decode(encoding, 'ignore')

Here is an example for a Unicode string where each character is represented with 2 bytes in UTF-8 and that would've crashed if the split Unicode code point wasn't ignored:

>>> unicode_truncate(u'абвгд', 5)
u'\u0430\u0431'
paulie4
  • 457
  • 3
  • 10
Denis Otkidach
  • 32,032
  • 8
  • 79
  • 100
  • I really like this suggestion! Very few lines of code and it seems like it would work in most cases. Obviously it might screw up combining characters but I explicitly said that is okay in the question. – JasonSmith Dec 01 '09 at 04:20
  • 1
    Note that this answer relies on `'ignore'` to sometimes avoid `UnicodeDecodeError: 'utf-8' codec can't decode bytes in position [x-y]: unexpected end of data`. This use of `'ignore'` doesn't seem to be an issue because it is not used as an arg to `encode()` too. – Asclepius Jun 03 '19 at 14:18
9

One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.

As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)

Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless of length, we only check the last 1-4 bytes.

# coding: UTF-8

def decodeok(bytestr):
    try:
        bytestr.decode("UTF-8")
    except UnicodeDecodeError:
        return False
    return True

def is_first_byte(byte):
    """return if the UTF-8 @byte is the first byte of an encoded character"""
    o = ord(byte)
    return ((0b10111111 & o) != o)

def truncate_utf8(bytestr, maxlen):
    u"""

    >>> us = u"ウィキペディアにようこそ"
    >>> s = us.encode("UTF-8")

    >>> trunc20 = truncate_utf8(s, 20)
    >>> print trunc20.decode("UTF-8")
    ウィキペディ
    >>> len(trunc20)
    18

    >>> trunc21 = truncate_utf8(s, 21)
    >>> print trunc21.decode("UTF-8")
    ウィキペディア
    >>> len(trunc21)
    21
    """
    L = maxlen
    for x in xrange(1, 5):
        if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
            return bytestr[:L-x]
    return bytestr[:L]

if __name__ == '__main__':
    # unicode doctest hack
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8")
    import doctest
    doctest.testmod()
u0b34a0f6ae
  • 48,117
  • 14
  • 92
  • 101
  • Thanks, kaizer.se. I implemented a very similar algorithm for JSON's backslash-escaping format but it's great to know the UTF-8 solution! – JasonSmith Nov 27 '09 at 16:13
  • 2
    Careful here if you pass the string to be serialized as JSON: if the string contains certain characters, these will get escaped, and the string's size will grow. You cannot simply truncate the original UTF-8 to X bytes. (Say the string was r'\\\\\\\\\\\\\' -- X \'s. This, when serialized to JSON, would double in size. – Thanatos Nov 28 '09 at 06:00
  • @Thanatos: I understood it as if there were two alternatives in the question: either serialize as a UTF-8 bytestream or as a JSON object, not a composition thereof. – u0b34a0f6ae Nov 28 '09 at 12:10
  • @u0b34a0f6ae For the apple push service, there is a payload length limitation of 256 bytes, the bytes to count are UTF-8-encoded JSON. The tricky part is that you need to truncate field value(s) enough within the JSON payload where possible to stay under the 256-byte limit. The 1 to 4 bytes-per-code-point UTF-8 encoding option works with Apple's push service, but had no success last time I checked with the \uXXXX\uYYYY surrogate-pair approach (part of the JSON spec) to encode a code point. – Stefan L May 26 '14 at 15:58
2

This will do for UTF8, If you like to do it in regex.

import re

partial="\xc2\x80\xc2\x80\xc2"

re.sub("([\xf6-\xf7][\x80-\xbf]{0,2}|[\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

"\xc2\x80\xc2\x80"

Its cover from U+0080 (2 bytes) to U+10FFFF (4 bytes) utf8 strings

Its really straight forward just like UTF8 algorithm

From U+0080 to U+07FF It will need 2 bytes 110yyyxx 10xxxxxx Its mean, if you see only one byte in the end like 110yyyxx (0b11000000 to 0b11011111) It is [\xc0-\xdf], it will be partial one.

From U+0800 to U+FFFF is 3 bytes needed 1110yyyy 10yyyyxx 10xxxxxx If you see only 1 or 2 bytes in the end, it will be partial one. It will match with this pattern [\xe0-\xef][\x80-\xbf]{0,1}

From U+10000–U+10FFFF is 4 bytes needed 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx If you see only 1 to 3 bytes in the end, it will be partial one It will match with this pattern [\xf6-\xf7][\x80-\xbf]{0,2}

Update :

If you only need Basic Multilingual Plane, You can drop last Pattern. This will do.

re.sub("([\xe0-\xef][\x80-\xbf]{0,1}|[\xc0-\xdf])$","",partial)

Let me know if there is any problem with that regex.

YOU
  • 120,166
  • 34
  • 186
  • 219
1

For JSON formatting (unicode escape, e.g. \uabcd), I am using the following algorithm to achieve this:

  • Encode the Unicode string into the backslash-escape format which it would eventually be in the JSON version
  • Truncate 3 bytes more than my final limit
  • Use a regular expression to detect and chop off a partial encoding of a Unicode value

So (in Python 2.5), with some_string and a requirement to cut to around 100 bytes:

# Given some_string is a long string with arbitrary Unicode data.
encoded_string = some_string.encode('unicode_escape')
partial_string = re.sub(r'([^\\])\\(u|$)[0-9a-f]{0,3}$', r'\1', encoded_string[:103])
final_string   = partial_string.decode('unicode_escape')

Now final_string is back in Unicode but guaranteed to fit within the JSON packet later. I truncated to 103 because a purely-Unicode message would be 102 bytes encoded.

Disclaimer: Only tested on the Basic Multilingual Plane. Yeah yeah, I know.

JasonSmith
  • 72,674
  • 22
  • 123
  • 149
1

Check the last character of the string. If high bit set, then it is not the last byte in a UTF-8 character, so back up and try again until you find one that is.

mxlen=255        
while( toolong.encode("utf8")[mxlen-1] & 0xc0 == 0xc0 ):
    mxlen -= 1

truncated_string = toolong.encode("utf8")[0:mxlen].decode("utf8")
Al Foиce ѫ
  • 4,195
  • 12
  • 39
  • 49
Peter Silva
  • 61
  • 1
  • 1