How do I split a multi-languages line in Python and get the Unicode hex value?

Question

I try to split this kind of lines in Python:

aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"

This line contains Hebrew, simplified Chinese and English.

If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).

The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:

print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))

And I get this error:

SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

It might be worth indicating the version of Python you are using (2.x or 3.x) either in the question, tags, or both. — Samuel Harmer, Jan 06 '12 at 09:44
Did you declare any encoding at the beginning of your file, such as #coding:utf-8? — Thomas Orozco, Jan 06 '12 at 10:46
The problem you state is a very clear erro that even contains the link to the text that tells you how to solve it. Why didn't you read the link? As a result, this is a duplicate of [working with utf-8 encoding in python source](http://stackoverflow.com/questions/6289474/working-with-utf-8-encoding-in-python-source) — Lennart Regebro, May 04 '13 at 16:25
The error includes the text "see http://www.python.org/peps/pep-0263.html for details". That link tells you exactly how to fix the error you got. — Lennart Regebro, May 04 '13 at 19:44

score 2 · Answer 1 · answered Jan 06 '12 at 09:07

2

In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:

print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))

In Python 3, string constants are Unicode by default.

answered Jan 06 '12 at 09:07

Avi

19,934
4
57
70

Giacomo Lacava · Accepted Answer · 2012-01-06T13:07:59.750

In Python 2, you need to open the file specifying an encoding like this:

import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8")

In Python 3, you can just add the encoding option to any open() calls.

This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).

To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():

>>> ord(u"£")
163

if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.

Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:

>>> unicodedata.bidirectional(u"£")
ET  # 'E'uropean 'T'erminator

How do I split a multi-languages line in Python and get the Unicode hex value?

2 Answers2