1

The Hebrew language has unicode representation between 1424 and 1514 (or hex 0590 to 05EA).

I'm looking for the right, most efficient and most pythonic way to achieve this.

First I came up with this:

for c in s:
    if ord(c) >= 1424 and ord(c) <= 1514:
        return True
return False

Then I came with a more elegent implementation:

return any(map(lambda c: (ord(c) >= 1424 and ord(c) <= 1514), s))

And maybe:

return any([(ord(c) >= 1424 and ord(c) <= 1514) for c in s])

Which of these are the best? Or i should do it differently?

Elise van Looij
  • 4,162
  • 3
  • 29
  • 52
iTayb
  • 12,373
  • 24
  • 81
  • 135
  • 2
    Try using a regular expression for the range of characters you want to look for. [See this question for the details.](http://stackoverflow.com/questions/1694350/how-can-i-detect-hebrew-characters-both-iso8859-8-and-utf8-in-a-string-using-php) – Li-aung Yip May 19 '12 at 10:17

3 Answers3

18

You could do:

# Python 3.
return any("\u0590" <= c <= "\u05EA" for c in s)
# Python 2.
return any(u"\u0590" <= c <= u"\u05EA" for c in s)
MRAB
  • 20,356
  • 6
  • 40
  • 33
2

Its simple to check the first character with unidcodedata:

import unicodedata

def is_greek(term):
    return 'GREEK' in unicodedata.name(term.strip()[0])


def is_hebrew(term):
    return 'HEBREW' in unicodedata.name(term.strip()[0])
yekta
  • 3,363
  • 3
  • 35
  • 50
1

Your basic options are:

  1. Match against a regex containing the range of characters; or
  2. Iterate over the string, testing for membership of the character in a string or set containing all of your target characters, and break if you find a match.

Only actual testing can show which is going to be faster.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • both are **much** slower than what he already has, testing a character against a defined range is definitely faster that checking for a ~100 character long string membership or against a regex – lenik May 19 '12 at 10:22
  • 1
    @lenik That's the weakest response I've ever seen. I hope for your sake you don't pull that in the office. – Marcin May 19 '12 at 11:29
  • 2
    @lenik: In fact, you're wrong. In my tests, a regex is easily fastest. The next best (and more Pythonic) is to reverse Marcin's suggestion 2, so you iterate over the hebrew characters and test for membership in the string. The numbers: https://gist.github.com/2730521 – Thomas K May 19 '12 at 11:34
  • (obviously it depends a bit on the conditions - I'm assuming hebrew characters are relatively rare in the input, and that the program processes enough strings that setup costs can be ignored) – Thomas K May 19 '12 at 11:41
  • @ThomasK `chars = {chr(x) for x in range(0x590, 0x5EA + 1)}` requires python 3, while original versions support python 2, so your results prove nothing unless you provide another python 2 compatible solution – lenik May 19 '12 at 15:07
  • @lenik: change `chr` to `unichr`. Set literals are in Python 2.7, but for older versions you can use an explicit `set()`. And the performance is, unsurprisingly, pretty much the same, with regexes still easily fastest. Here you are: https://gist.github.com/2731446 – Thomas K May 19 '12 at 16:45
  • @ThomasK thank you, I've checked your tests and my results are similar to yours: 1) membership in string is slower 2) membership in set() is almost same as original range check 3) regex is surprisingly fast. unfortunately, regex aside, membership in (string or set) performance becomes inferior when the data set grows large (think Chinese or Japanese), while range check performance stays constant no matter how large is the range. – lenik May 19 '12 at 17:19
  • @lenik: set membership tests shouldn't degrade with size. I'd be interested to see tests of the different methods for CJK characters. – Thomas K May 19 '12 at 17:26
  • @ThomasK while set() membership test might be only O(1), large sets take more time to create and more memory to keep. range tests require neither. – lenik May 19 '12 at 17:49
  • 1
    @lenik: I'm assuming the program tests enough strings that the cost of creating the set can be ignored. Also, Chinese characters are not in one contiguous range, so you'd need a more complex range check for each character. – Thomas K May 19 '12 at 18:20