Problem with .decode('utf-8').upper() and special characters (but only inside the string)

Question

I would like to capitalise letters on given position in string. I have a problem with special letters - polish letters to be specific: for example "ą". Ideally would be a solution which works also for french, spanish etc. (ç, è etc.)

dobry="costąm"
print(dobry[4].decode('utf-8').upper())

I obtain:

  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: unexpected end of data

while for this:

print("ą".decode('utf-8').upper())

I obtain Ą as desired.

What is more curious for letters on positions 0-3 it works fine while for:

print(dobry[5].decode('utf-8').upper())

I obtain the same problem

Possible duplicate of [How can I convert Unicode to uppercase to print it?](https://stackoverflow.com/questions/727507/how-can-i-convert-unicode-to-uppercase-to-print-it) — tripleee, Feb 24 '19 at 11:18

score 3 · Answer 1 · answered Feb 24 '19 at 10:54

3

The string actually looks like this:

>>> list(dobry)
['c', 'o', 's', 't', '\xc4', '\x85', 'm']

So, dobry[5] == '\x85' because the letter ą is represented by two bytes. To solve this, simply use Python 3 instead of Python 2.

answered Feb 24 '19 at 10:54

ForceBru

43,482
10
63
98

1

+1 for the recommendation to switch to Python 3. There is a lot more that could be said about the proper way to solve this, though, even in Python 3. – tripleee Feb 24 '19 at 11:17

snakecharmerb · Answer 2 · 2019-02-24T12:26:09.393

UTF-8 may use more than one byte to encode a character, so iterating over a bytestring and manipulating individual bytes won't always work. It's better to decode to Python 2's unicode type. Perform your manipulations, then re-encode to UTF-8.

>>> dobry="costąm"
>>> udobry = unicode(dobry, 'utf-8')
>>> changed = udobry[:4] + udobry[4].upper() + udobry[5]
>>> new_dobry = changed.encode('utf-8')
>>> print new_dobry
costĄm

As @tripleee commented, non-ascii characters may not map to a single unicode codepoint: "ą" could be the single codepoint U+0105 LATIN SMALL LETTER A WITH OGONEK or it could be composed of "a" followed by U+0328 COMBINING OGONEK.

In the composed string the "a" character can be capitalised, and "a" followed by COMBINING OGONEK will result in "Ą" (though it may look like two separate characters in the Python REPL, or the terminal, depending on the terminal settings).

Note that you need to take the extra character into account when indexing.

It's also possible to normalise the composed string to the single codepoint (canonical) version using the tools in the unicodedata module:

>>> unicodedata.normalize('NFC', u'costa\u0328m') ==  u"costąm"
True

but this may cause problems if, for example, you are returning the changed string to a system that expects the combining character to be preserved.

There is no guarantee that a `unicode` string uses a single code point for a single glyph, either. In fact, many characters can only be represented by combining sequences. You can try to use Unicode normalization (see also [Wikipedia](https://en.wikipedia.org/wiki/Unicode_equivalence)) to force the ones which can be represented as a single code point to that representation, but the better approach for a number of reasons is to do exactly the opposite. Now `"a"` + [U+0328 COMBINING OGONEK](http://www.fileformat.info/info/unicode/char/0328/) naturally uppercases to `"A"` + combining ogonek. — tripleee, Feb 24 '19 at 11:16

score 1 · Answer 3 · answered Feb 24 '19 at 10:58

1

what about that instead:

print(dobry.decode('utf-8')[5].upper())

answered Feb 24 '19 at 10:58

Benoît P

3,179
13
31

Problem with .decode('utf-8').upper() and special characters (but only inside the string)

3 Answers3