Same character, different length and bytes

Question

Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:

>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>

Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:

What is the difference between these two strings?
Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
And how can I fix this? (e.g. convert second_string to the likeness of first_string)

Thank you.

Those are not the same characters: the first one has a tilde. — Willem Van Onsem, Jan 31 '18 at 08:10

tremby · Accepted Answer · 2018-01-31T08:43:59.290

2

An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.

The first one is:
```
<â> 226, Hex 00e2, Octal 342
```
And the second:
```
<a>  97,  Hex 61,  Octal 141 < ̂> 770, Hex 0302, Octal 1402
```
In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.
Ask the website operators. How would we know?!
You need something which turns combining characters into regular characters. A Google search yielded this question, for example.

As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.

edited Jan 31 '18 at 08:43

answered Jan 31 '18 at 08:13

tremby

9,541
4
55
74

I used Python's [unicodedata.normalize](https://docs.python.org/3.1/library/unicodedata.html#unicodedata.normalize) with 'NFC' as the form to do the normalization and the encoding goes smoothly. – Jake Hyun 13 secs ago edit – Syphon Jan 31 '18 at 08:41
Cool. I'll add it to the answer. – tremby Jan 31 '18 at 08:42

clemens · Answer 2 · 2018-01-31T08:26:53.747

There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.
A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.
You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:
```
from unicodedata import normalize

s = u'a\u0302'
print s, len(s), len(normalize("NFC", s))
```

will output â 2 1.

Same character, different length and bytes

2 Answers2