2

I want to find something with re in Hebrew:

page = urlopen(url)
page_content = page.read()
founds = re.findall("מילים בעברית", page_content)

the error is: SyntaxError: Non-ASCII character '\xec' in file C:/Users/User/untitled/milimBeIvrit.py on line 12, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

0x90
  • 39,472
  • 36
  • 165
  • 245
Aluma Gelbard
  • 43
  • 1
  • 6

2 Answers2

3

Yes, re can handle utf-8 strings.

You can change your default encoding if you want (but you don't have to)

>>> import sys
>>> import re
>>> sys.getdefaultencoding()
'ascii'

My default encoding is ascii and the following still works:

>>> a='אבא בא'
>>> results = re.findall("א", a)
>>> results
['\xd7\x90', '\xd7\x90', '\xd7\x90']

In order to print in a humanly readable format use print:

>>> for r in results:
...     print r

א
א
א

Note that idle has had some issues with the utf8 handling, so one might consider to use an IDE such as PyCharm.

0x90
  • 39,472
  • 36
  • 165
  • 245
  • thats doesnt work with ascii encode. how can i change it? – Aluma Gelbard Jan 01 '16 at 20:02
  • 1
    [sys.setdefaultencoding('utf-8')](http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script) – 0x90 Jan 01 '16 at 20:03
  • So you have other problem. Update your question accordingly/ask another question. As you can see this perfectly works for me. – 0x90 Jan 01 '16 at 20:10
2

You do not say if this is Python 2 or 3.... If Python 2 - then you will have to play with encode and decode and there is no native Unicode.

However in Python 3 - this is how I would do this.... Sorry I can not good with Hebrew - small Arabic example instead.... but same principle.

import re
sentance='المتساقطة، تحت. من كردة مسارح قُدُماً ضرب, لان بشكل أكثر'
fs=re.search('لان', sentance)
if fs:
   print("Found it")

I have no idea what the arabic expression is - I pulled it from http://generator.lorem-ipsum.info/_arabic.

I must stress - Unicode text is easy in Python3 but way way more pain in Python 2....

Exact the same as my arabic example - using Hebrew lore-ipsum (which I never knew existed until 30 seconds ago).

import re
sen2="רביעי ביולוגיה את אתה. מתן של מיזם המלצת ליצירתה, גם שכל חשמל אדריכלות למתחילים. צילום הבאים בעברית אחד בה. בדף או ריקוד מונחים לחשבון, ב הקהילה רב־לשוני זכר, וספציפיים האנציקלופדיה אל חפש. מתן אל נפלו עזרה אנתרופולוגיה."
fs=re.search('בדף',sen2)
if fs:
   print("Found it")

Looks ok to me....

0x90
  • 39,472
  • 36
  • 165
  • 245
Tim Seed
  • 5,119
  • 2
  • 30
  • 26