How to decode unicode in a Chinese text

Question

with open('result.txt', 'r') as f:
data = f.read()

print 'What type is my data:'
print type(data)

for i in data:
    print "what is i:"
    print i
    print "what type is i"
    print type(i)


    print i.encode('utf-8')

I have file with string and I am trying to read the file and split the words by space and save them into a list. Below is my code:

Below is my error messages:

Someone please help!

Update:

I am going to describe what I am trying to do in details here so it give people more context: The goal of what I am trying to do is: 1. Take a Chinese text and break it down into sentences with detecting basic ending punctuations. 2. Take each sentence and use the tool jieba to tokenize characters into meaningful words. For instances, two Chinese character 學，生, will be group together to produce a token '學生' (meaning student). 3. Save all the tokens from the sentence into a list. So the final list will have multiple lists inside as there are multiple sentences in a paragraph.

# coding: utf-8 
#encoding=utf-8

import jieba

cutlist = "。！?".decode('utf-8')
test = "【明報專訊】「吉野家」and Peter from US因被誤傳採用日本福島米而要報警澄清，並自爆用內地黑龍江米，日本料理食材來源惹關注。本報以顧客身分向6間日式食店查詢白米產地，其中出售逾200元日式豬扒飯套餐的「勝博殿日式炸豬排」也選用中國大連米，誤以為該店用日本米的食客稱「要諗吓會否再幫襯」，亦有食客稱「好食就得」；壽司店「板長」店員稱採用香港米，公關其後澄清來源地是澳洲，即與平價壽司店「爭鮮」一樣。有飲食界人士稱，雖然日本米較貴、品質較佳，但內地米品質亦有保證。"

#FindToken check whether the character has the ending punctuation
def FindToken(cutlist, char):
    if char in cutlist:
        return True
    else:
        return False

'''
cut check each item in a string list, if the item is not the ending punctuation, it will save it to a temporary list called line. When the ending punctuation is encountered it will save the complete sentence that has been collected in the list line into the final list. '''

def cut(cutlist,test):
    l = []
    line = []
    final = []

'''
check each item in a string list, if the item is not the ending punchuation , it will save it to a temporary list called line. When the ending punchuation is encountered it will save the complete sentence that has been collected in the list line into the final list. '''

    for i in test:
        if i == ' ':
            line.append(i)

        elif FindToken(cutlist,i):
            line.append(i)
            l.append(''.join(line))
            line = []
        else:
            line.append(i)

    temp = [] 
    #This part iterate each complete sentence and then group characters according to its context.
    for i in l:
        #This is the function that break down a sentence of characters and group them into phrases
        process = list(jieba.cut(i, cut_all=False))

        #This is puting all the tokenized character phrases of a sentence into a list. Each sentence 
        #belong to one list.
        for j in process:
            temp.append(j.encode('utf-8')) 
            #temp.append(j) 
        print temp 

        final.append(temp)
        temp = [] 
    return final 


cut(list(cutlist),list(test.decode('utf-8')))

Here is my problem, when I output my final list, it gives me a list of the following result:

[u'\u3010', u'\u660e\u5831', u'\u5c08\u8a0a', u'\u3011', u'\u300c', u'\u5409\u91ce\u5bb6', u'\u300d', u'and', u' ', u'Peter', u' ', u'from', u' ', u'US', u'\u56e0', u'\u88ab', u'\u8aa4\u50b3', u'\u63a1\u7528', u'\u65e5\u672c', u'\u798f\u5cf6', u'\u7c73', u'\u800c', u'\u8981', u'\u5831\u8b66', u'\u6f84\u6e05', u'\uff0c', u'\u4e26', u'\u81ea\u7206', u'\u7528\u5167', u'\u5730', u'\u9ed1\u9f8d', u'\u6c5f\u7c73', u'\uff0c', u'\u65e5\u672c\u6599\u7406', u'\u98df\u6750', u'\u4f86\u6e90', u'\u60f9', u'\u95dc\u6ce8', u'\u3002']

How can I turn a list of unicode into normal string?

unrelated: include the error message as text instead of the picture. It may help other people with the same error, to find the question — jfs, Oct 23 '15 at 17:44
Limit your questions to a *single* issue per question e.g., you should ask *«how to convert `[u'\u3010', u'\u660e\u5831']` to a "normal" string»* as a separate question. [I've already provided a hint for the solution](http://stackoverflow.com/questions/33294213/how-to-decode-unicode-in-a-chinese-text/33294804#comment54415565_33306456) -- learn the difference between `print [u'\u3010', u'\u660e\u5831']` and `print " ".join([u'\u3010', u'\u660e\u5831'])` — jfs, Oct 24 '15 at 11:05
Here's a very good tutorial which eliminates all your confusion.http://www.pgbovine.net/unicode-python.htm — MartianMartian, Feb 04 '17 at 06:05

mpontillo · Answer 1 · 2015-10-23T20:23:39.347

7

Let me give you some hints:

You'll need to decode the bytes you read from UTF-8 into Unicode before you try to iterate over the words.
When you read a file, you won't get Unicode back. You'll just get plain bytes. (I think you knew that, since you're already using decode().)
There is a standard function to "split by space" called split().
When you say for i in data, you're saying you want to iterate over every byte of the file you just read. Each iteration of your loop will be a single character. I'm not sure if that's what you want, because that would mean you'd have to do UTF-8 decoding by hand (rather than using decode(), which must operate on the entire UTF-8 string.).

In other words, here's one line of code that would do it:

open('file.txt').read().decode('utf-8').split()

If this is homework, please don't turn that in. Your teacher will be onto you. ;-)

Edit: Here's an example how to encode and decode unicode characters in python:

>>> data = u"わかりません"
>>> data
u'\u308f\u304b\u308a\u307e\u305b\u3093'
>>> data_you_would_see_in_a_file = data.encode('utf-8')
>>> data_you_would_see_in_a_file
'\xe3\x82\x8f\xe3\x81\x8b\xe3\x82\x8a\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93'
>>> for each_unicode_character in data_you_would_see_in_a_file.decode('utf-8'):
...     print each_unicode_character
... 
わ
か
り
ま
せ
ん

The first thing to note is that Python (well, at least Python 2) uses the u"" notation (note the u prefix) on string constants to show that they are Unicode. In Python 3, strings are Unicode by default, but you can use b"" if you want a byte string.

As you can see, the Unicode string is composed of two-byte characters. When you read the file, you get a string of one-byte characters (which is equivalent to what you get when you call .encode(). So if you have bytes from a file, you must call .decode() to convert them back into Unicode. Then you can iterate over each character.

Splitting "by space" is something unique to every language, since many languages (for example, Chinese and Japanese) do not uses the ' ' character, like most European languages would. I don't know how to do that in Python off the top of my head, but I'm sure there is a way.

edited Oct 23 '15 at 20:23

answered Oct 23 '15 at 04:17

mpontillo

13,559
7
62
90

Hi Mike, thanks for the answer! My last problem is that, after reading each characters, I tried to save them into a list. However, when I print the list, the result are all in unicode. Is there anyway I can turn that back into a readable list of characters? (This is for my work and not for school, haha. I am very new to do SWE stuff, so I really appreciate your help!) – YAL Oct 23 '15 at 17:14
1

Those characters saved into a list turned into stuff like \xe4\xba\xa4 – YAL Oct 23 '15 at 17:21
@YAL, if you have evolved your code, can you update the question with the new code? (and paste in the terminal output you see when you test it out, rather than adding the picture - as someone else suggested.) I'll update my answer to try to give an example about how this works. – mpontillo Oct 23 '15 at 20:20
@YAL, actually, I also noticed you're using `encode()` when you should be using `decode()`. (and, as I said, you should be using it on the entire text from the file, not each individual character. decoding UTF-8 into 2-byte Unicode strings cannot be done with individual bytes.) – mpontillo Oct 23 '15 at 20:29
Thanks for the reply! So I edited my question with more details above. My problem right now is that I don't quite know how to convert a list of unicode into a list of string. I thank you for reminding me that I cannot I cannot encode individual item, I would like to know how I could do it otherwise. Thanks again!! :D – YAL Oct 23 '15 at 20:45
@YAL well, you could do something like `' '.join(items)`, which will create a space-separated list of all your items. If you want a list of UTF-8 byte strings, you could write something like`[s.encode('utf-8') for s in items]` (which is called a list comprehension). This all assumes your list is in a variable called `items`. – mpontillo Oct 23 '15 at 21:00
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93216/discussion-between-yal-and-mike). – YAL Oct 23 '15 at 21:05
So I have tried both of the method and the first method returned me with the same result. When ' '.join(items) is added into a list, the final list is still not in string. The second method would return me something like this ['\xe3\x80\x90', '\xe6\x98\x8e\xe5\xa0\xb1', '\xe5\xb0\x88\xe8\xa8\x8a' – YAL Oct 23 '15 at 21:09

ShadowRanger · Answer 2 · 2015-10-23T21:18:45.050

When you call encode on a str with most (all?) codecs (for which encode really makes no sense; str is a byte oriented type, not a true text type like unicode that would require encoding), Python is implicitly decodeing it as ASCII first, then encoding with your specified encoding. If you want the str to be interpreted as something other than ASCII, you need to decode from bytes-like str to true text unicode yourself.

When you do i.encode('utf-8') when i is a str, you're implicitly saying i is logically text (represented by bytes in the locale default encoding), not binary data. So in order to encode it, it first needs to decode it to determine what the "logical" text is. Your input is probably encoded in some ASCII superset (e.g. latin-1, or even utf-8), and contains non-ASCII bytes; it tries to decode them using the ascii codec (to figure out the true Unicode ordinals it needs to encode as utf-8), and fails.

You need to do one of:

Explicitly decode the str you read using the correct codec (to get a unicode object), then encode that back to utf-8.
Let Python do the work from #1 for you implicitly. Instead of using open, import io and use io.open (Python 2.7+ only; on Python 3+, io.open and open are the same function), which gets you an open that works like Python 3's open. You can pass this open an encoding argument (e.g. io.open('/path/to/file', 'r', encoding='latin-1')) and reading from the resulting file object will get you already decode-ed unicode objects (that can then be encode-ed to whatever you like with).

Note: #1 will not work if the real encoding is something like utf-8 and you defer the work until you're iterating character by character. For non-ASCII characters, utf-8 is multibyte, so if you only have one byte, you can't decode (because the following bytes are needed to calculate a single ordinal). This is a reason to prefer using io.open to read as unicode natively so you're not worrying about stuff like this.

Ah, found the actual implementation. [`utf_8_encode` forces conversion to `unicode`](https://hg.python.org/cpython/file/2.7/Modules/_codecsmodule.c#l685) using [`PyUnicode_FromObject`](https://docs.python.org/2/c-api/unicode.html#c.PyUnicode_FromObject), which is equivalent in Py2 to calling `decode` with `NULL`, which apparently always means `ascii` (scroll down on the second link for that), not the locale default. Fun. Solution #2 and #3 remain workable, but #1 does not, because locale is ignored. — ShadowRanger, Oct 23 '15 at 21:13
Removed #1 and renumbered 2 and 3 to be 1 and 2, with corrected explanation. — ShadowRanger, Oct 23 '15 at 21:19
Python 2 uses `sys.getdefaultencoding()` to convert a bytestring (`str`) to unicode implicitly (as it has to do while calling `.encode('utf-8')`. It is ASCII by default but some misguided environments (e.g., an IDE) may configure it to something else. — jfs, Oct 23 '15 at 21:58

jfs · Answer 3 · 2015-10-23T17:28:04.270

2

data is a bytestring (str type on Python 2). Your loop looks at one byte at a time (non-ascii characters may be represented using more than one byte in utf-8).

Don't call .encode() on bytes:

$ python2
>>> '\xe3'.enϲodе('utf˗8＇) #XXX don't do it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

I am trying to read the file and split the words by space and save them into a list.

To work with Unicode text, use unicode type in Python 2. You could use io.open() to read Unicode text from a file (here's the code that collects all space-separated words into a list):

#!/usr/bin/env python
import io

with io.open('result.txt', encoding='utf-8') as file:
    words = [word for line in file for word in line.split()]
print "\n".join(words)

edited Oct 23 '15 at 17:28

answered Oct 23 '15 at 15:42

jfs

399,953
195
994
1,670

Hello, thank you for the answer! My last problem here is that, after I save the words into the list, those words become unicode instead of string. How could I turn that back into string again? Thank you so much! – YAL Oct 23 '15 at 17:11
Those characters saved into a list turned into stuff like \xe4\xba\xa4 – YAL Oct 23 '15 at 17:21
@YAL: Unicode strings are strings. You don't need to convert them e.g., to print the words (one per line): `print "\n".join(words)`. If it fails in your environment, read e.g., [this](http://stackoverflow.com/a/29577565/4279), [this](http://stackoverflow.com/a/22552581/4279) and [this](http://stackoverflow.com/a/33060935/4279). If something is unclear; ask a separate question. – jfs Oct 23 '15 at 17:23
Hi Sebastian, thanks for the reply. What I am trying to achieve here is to save individual Chinese phase token into a list. I am able to print words with the command you suggested but I am not sure how to print or save the individual output with readable string into a list. – YAL Oct 23 '15 at 17:29
@YAL: if you see `"\xe4\xba\xa4"` instead of `u'\u4ea4'` (`交`) then it means that the variable is not Unicode (you should drop `.encode()`, `str()`, etc from your code). – jfs Oct 23 '15 at 17:29
I actually do see u'\u4ea4' but I would like to save it as 交. >_< Could you give me some guidance? Sorry I just started working recently and I realized I am very inexperienced in this. – YAL Oct 23 '15 at 17:32
`u'\u4ea4'` and `u'交'` is the same string. Do you understand the difference between `print words` and `print " ".join(words)`? (`repr()` is called on the items in the former case, see [why does Python console display u'\u20ac' instead of €](http://stackoverflow.com/a/32748657/4279)) [To save Unicode to a file, use `io.open()`](http://stackoverflow.com/a/31168730/4279) – jfs Oct 23 '15 at 17:40
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93198/discussion-between-yal-and-j-f-sebastian). – YAL Oct 23 '15 at 17:59

Nianliang · Answer 4 · 2020-02-29T12:30:31.043

Encoding:

$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> base64.b64encode("我们尊重原创。".encode('utf-8'))
b'5oiR5Lus5bCK6YeN5Y6f5Yib44CC'

Decoding:

>>> import base64
>>> str='5oiR5Lus5bCK6YeN5Y6f5Yib44CC'
>>> base64.b64decode(str)
b'\xe6\x88\x91\xe4\xbb\xac\xe5\xb0\x8a\xe9\x87\x8d\xe5\x8e\x9f\xe5\x88\x9b\xe3\x80\x82'
>>> base64.b64decode(str).decode('utf-8')
'我们尊重原创。'
>>>

How to decode unicode in a Chinese text

4 Answers4

Encoding:

Decoding:

Linked