Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are

Question

[Python 2] SUB = string.maketrans("0123456789","₀₁₂₃₄₅₆₇₈₉")

this code produces the error:

ValueError: maketrans arguments must have same length

I am unsure why this occurs because the strings are the same length. My only idea is that the subscript text length is somehow different than standard size characters but I don't know how to get around this.

Works fine in Python 3 (which does have much nicer unicode support anyway), is that an option for you? — Stefan Pochmann, May 07 '15 at 18:51
currently I'm running python 2.7 but I will be sure to take a look at Python 3 — Aaron, May 07 '15 at 19:40
That Python 3 code is from @ZeroPiraeus' neat answer to ["Printing subscript in python"](https://stackoverflow.com/questions/24391892/printing-subscript-in-python/24392215#24392215) — smci, Sep 10 '18 at 05:58

Martijn Pieters · Accepted Answer · 2015-05-07T18:28:01.487

12

No, the arguments are not the same length:

>>> len("0123456789")
10
>>> len("₀₁₂₃₄₅₆₇₈₉")
30

You are trying to pass in encoded data; I used UTF-8 here, where each digit is encoded to 3 bytes each.

You cannot use str.translate() to map ASCII bytes to UTF-8 byte sequences. Decode your string to unicode and use the slightly different unicode.translate() method; it takes a dictionary instead:

nummap = {ord(c): ord(t) for c, t in zip(u"0123456789", u"₀₁₂₃₄₅₆₇₈₉")}

This creates a dictionary mapping Unicode codepoints (integers), which you can then use on a Unicode string:

>>> nummap = {ord(c): ord(t) for c, t in zip(u"0123456789", u"₀₁₂₃₄₅₆₇₈₉")}
>>> u'99 bottles of beer on the wall'.translate(nummap)
u'\u2089\u2089 bottles of beer on the wall'
>>> print u'99 bottles of beer on the wall'.translate(nummap)
₉₉ bottles of beer on the wall

You can then encode the output to UTF-8 again if you so wish.

From the method documentation:

For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.

edited May 07 '15 at 18:28

answered May 07 '15 at 18:25

Martijn Pieters

1,048,767
296
4,058
3,343

is there any other way to get subscript characters in python? or even a way to over come this length difference – Aaron May 07 '15 at 18:27
2

Aaron: this would not be a limitation of Python ... but rather it's an implication of the differences between ASCII and Unicode. There are no "subscript characters" in ASCII. The implications of using *Unicode* characters is that Python cannot treat such characters as if they were ASCII --- any attempt to do so may work for some cases but will break for others. – Jim Dennis May 07 '15 at 18:51
@Martijn Where did you get 30? I get either 10 or "Unsupported characters in input", depending on where I try it. – Stefan Pochmann May 07 '15 at 18:55
@StefanPochmann: using the interactive interpreter in a terminal configured for UTF-8 use. – Martijn Pieters May 07 '15 at 19:48
Only in Python 2. The length is 30 in Python 2 and 10 in Python 3. OP's code works fine in Python 3. – smci Sep 10 '18 at 05:55
@smci exactly; you’ll only see this specific error in Python 2 because these are byte strings. That's why the question is tagged with the [tag:python-2.x] tag. – Martijn Pieters Sep 10 '18 at 08:34

Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are

1 Answers1

Linked