4

Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:

>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>

Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:

  1. What is the difference between these two strings?
  2. Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
  3. And how can I fix this? (e.g. convert second_string to the likeness of first_string)

Thank you.

clemens
  • 16,716
  • 11
  • 50
  • 65
Syphon
  • 189
  • 1
  • 11

2 Answers2

2
  1. An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.

    The first one is:

    <â> 226, Hex 00e2, Octal 342
    

    And the second:

    <a>  97,  Hex 61,  Octal 141 < ̂> 770, Hex 0302, Octal 1402
    

    In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.

  2. Ask the website operators. How would we know?!

  3. You need something which turns combining characters into regular characters. A Google search yielded this question, for example.

    As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.

tremby
  • 9,541
  • 4
  • 55
  • 74
  • I used Python's [unicodedata.normalize](https://docs.python.org/3.1/library/unicodedata.html#unicodedata.normalize) with 'NFC' as the form to do the normalization and the encoding goes smoothly. – Jake Hyun 13 secs ago edit – Syphon Jan 31 '18 at 08:41
  • Cool. I'll add it to the answer. – tremby Jan 31 '18 at 08:42
2
  1. There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.

  2. A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.

  3. You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:

    from unicodedata import normalize
    
    s = u'a\u0302'
    print s, len(s), len(normalize("NFC", s))
    

will output â 2 1.

clemens
  • 16,716
  • 11
  • 50
  • 65