0

I try to split this kind of lines in Python:

aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"

This line contains Hebrew, simplified Chinese and English.

If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).

The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:

print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))

And I get this error:

SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
0x90
  • 39,472
  • 36
  • 165
  • 245
  • 1
    It might be worth indicating the version of Python you are using (2.x or 3.x) either in the question, tags, or both. – Samuel Harmer Jan 06 '12 at 09:44
  • 1
    Did you declare any encoding at the beginning of your file, such as #coding:utf-8? – Thomas Orozco Jan 06 '12 at 10:46
  • The problem you state is a very clear erro that even contains the link to the text that tells you how to solve it. Why didn't you read the link? As a result, this is a duplicate of [working with utf-8 encoding in python source](http://stackoverflow.com/questions/6289474/working-with-utf-8-encoding-in-python-source) – Lennart Regebro May 04 '13 at 16:25
  • @LennartRegebro what are you talking about? – 0x90 May 04 '13 at 19:38
  • The error includes the text "see http://www.python.org/peps/pep-0263.html for details". That link tells you exactly how to fix the error you got. – Lennart Regebro May 04 '13 at 19:44

2 Answers2

2

In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:

print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))

In Python 3, string constants are Unicode by default.

Avi
  • 19,934
  • 4
  • 57
  • 70
2

In Python 2, you need to open the file specifying an encoding like this:

import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8") 

In Python 3, you can just add the encoding option to any open() calls.

This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).

To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():

>>> ord(u"£")
163

if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.

Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:

>>> unicodedata.bidirectional(u"£")
ET  # 'E'uropean 'T'erminator
Giacomo Lacava
  • 1,784
  • 13
  • 25