How to read unicode file in python

Question

I have a tab separated file written as following:

col_name                    cnt
\u7834\u6653\u5fae\u660e     8
\u9ed8\u8ba4                12

I use pandas.read_excel to read them into python, and it display the same thing.

How can I read data and derive the following? Thanks!

col_name      cnt
破晓微明        8
默认           12

I am using python 3.7.7 and pandas 1.0.4

That's not Unicode, that's a sequence of US-ASCII characters that form escape sequences. `破` is Unicode, this entire page is Unicode. What are the *actual* contents of the file? Where did you see those escape sequences? There may not be *any* problem to solve - perhaps your terminal software and/or Python environment isn't configured to **display** all Unicode characters, so it displays the escape sequences for non US-ASCII characters — Panagiotis Kanavos, Jun 30 '20 at 07:37
@sammywemmy that's not Unicode, that's an escape sequence. Changing the encoding won't fix anything. It can only *mangle* the actual data — Panagiotis Kanavos, Jun 30 '20 at 07:38
@sammywemmy there are a *lot* of Python questions about Unicode, where the only problem is that someone mistook the escape sequences for Unicode itself. — Panagiotis Kanavos, Jun 30 '20 at 07:42
@PanagiotisKanavos Just to make it clear: On file level you will never have Unicode. Unicode is just the concept, the idea behind it all. What you will have are bytes in an encoding like UTF-8 or ISO--8859-1 or ASCII. — Matthias, Jun 30 '20 at 07:51
@Matthias I know -I've been using Unicode since 1995 at least and remember the codepage era. It's impractical to repeat Unicode history in comments though. Even the [Python docs](https://docs.python.org/3/howto/unicode.html#definitions) talk about `Unicode`: `Python’s string type uses the Unicode Standard for representing characters`. This is most likely *not* about the file though. It looks like the file was correctly loaded but the terminal/application displays escape sequences. — Panagiotis Kanavos, Jun 30 '20 at 08:04
I have a .txt file and its contents is something like \u7834. If I type u'\u7834' in Python, I am able to print the right content. Please simply tell me how to add prefix u before '\u7834' in Python. Thanks! — C. Luo, Jun 30 '20 at 08:05
Even if the file actually contains the escape sequences (I doubt it), changing the encoding *won't* change anything (unless it's UTF16/32 which will mangle the text). All the characters are in the 7-bit US-ASCII range so they'll be treated the same in all codepages/encodings — Panagiotis Kanavos, Jun 30 '20 at 08:06
@C.Luo post the *actual* contents in the question. Does it **really really really really* contain the characters `\ `,`u`,`7`,`8`,`3`,`4` ? I *really* doubt someone actually encoded text this way just to make everyone else's life miserable. You *don't* need to use any prefix with loaded data - the `u'...'` notation is just for string *literals*, ie strings typed directly in your program. You don't need escape sequences there either, you could just type the Chinese characters if you wanted. You can see the proof in your own question actually - this page is UTF8. 破晓微明 is just 4 characters — Panagiotis Kanavos, Jun 30 '20 at 08:07
@C.Luo in that case the easiest solution would be to ask whoever did this to produce a real Unicode file using UTF8 encoding, not US-ASCII with escape sequences. If they refuse you can use `.decode('unicode-escape')` while loading the text to convert the escape sequences back to Unicode. — Panagiotis Kanavos, Jun 30 '20 at 08:11

score 0 · Accepted Answer · answered Jun 30 '20 at 08:12

0

You need to decode the text with an appropriate decoder. For this case we can use unicode-escape. But to decode the text you have to make bytes out of it first.

col_name = r'\u7834\u6653\u5fae\u660e'
print(bytes(col_name, 'ascii').decode('unicode-escape'))

This will give you 破晓微明.

I don't think this can be done during the call to pandas.read_excel but I'm no pandas expert. You might have to change the contentn of the column after reading the file.

answered Jun 30 '20 at 08:12

Matthias

12,873
6
42
48

It would be better to tell whoever produced that file to produce a real UTF8 file instead. – Panagiotis Kanavos Jun 30 '20 at 08:14
Right. This fixes the current situation, but the problem starts before. – Matthias Jun 30 '20 at 08:43

How to read unicode file in python

1 Answers1