0

I have a text file with text that should have been interpreted as utf-8 but wasn't (it was given to me this way). Here is an example of a typical line of the file:

\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f

which should have been:

ロンドン在住

Now, I can do it manually on python by typing the following in the command line:

>>> h1 = u'\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'    
>>> print h1
ロンドン在住

which gives me what I want. Is there a way that I can do this automatically? I've tried doing stuff like this

>>> f = codecs.open('testfile.txt', encoding='utf-8')
>>> h = f.next()
>>> print h
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f

I've also tried with the 'encode' and 'decode' functions, any ideas?

Thanks!

  • 1
    There is no such thing as plain text, and there's really no such thing as UTF-8 text, either. Text is an abstraction. UTF-8 is an encoding of characters into bytes. Also, if the file actually contains backslashes, it's completely different from putting backslashes between double-quotes in a Python source file. That's a completely separate encoding step. If you want ロ in your file, then put ロ in your file. – Karl Knechtel Jun 18 '12 at 17:28

1 Answers1

3

\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f is not UTF8; it's using the python unicode escape format. Use the unicode_escape codec instead:

>>> print '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape')
ロンドン在住

Here is the UTF-8 encoding of the above phrase, for comparison:

>>> '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape').encode('utf-8')
'\xe3\x83\xad\xe3\x83\xb3\xe3\x83\x89\xe3\x83\xb3\xe5\x9c\xa8\xe4\xbd\x8f'

Note that the data decoded with unicode_escape are treated as Latin-1 for anything that's not a recognised Python escape sequence.

Be careful however; it may be you are really looking at JSON-encoded data, which uses the same notation for specifying character escapes. Use json.loads() to decode actual JSON data; JSON strings with such escapes are delimited with " quotes and are usually part of larger structures (such as JSON lists or objects).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343