Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:
>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>
Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:
- What is the difference between these two strings?
- Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
- And how can I fix this? (e.g. convert
second_string
to the likeness offirst_string
)
Thank you.