Find out the unicode script of a character

Question

Given a unicode character what would be the simplest way to return its script (as "Latin", "Hangul" etc)? unicodedata doesn't seem to provide this kind of feature.

See http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file . The link says that correctly detecting the encoding always is impossible. — amit kumar, Mar 26 '12 at 08:34
@DanielRoseman: http://en.wikipedia.org/wiki/Script_%28Unicode%29 — georg, Mar 26 '12 at 08:41
phaedrus, they're not trying to detect how the codepoint is encoded, just which writing system it's from (hence "Latin", "Hangul") which is do-able so long as you're happy to accept either a vague answer or none for some codepoints. — tialaramex, Mar 26 '12 at 08:41
@phaedrus: I understand that that question is not about detecting encodings, but about what is called the "standardized subsets" of Unicode. Reference: http://en.wikipedia.org/wiki/Unicode#Standardized_subsets. — Eric O. Lebigot, Mar 26 '12 at 08:47

score 26 · Accepted Answer · edited Sep 10 '19 at 09:46

26

I was hoping someone's done it before, but apparently not, so here's what I've ended up with. The module below (I call it unicodedata2) extends unicodedata and provides script_cat(chr) which returns a tuple (Script name, Category) for a unicode char. Example:

# coding=utf8
import unicodedata2
print unicodedata2.script_cat(u'Ф')  #('Cyrillic', 'L')
print unicodedata2.script_cat(u'の')  #('Hiragana', 'Lo')
print unicodedata2.script_cat(u'★')  #('Common', 'So')

The module: https://gist.github.com/2204527

edited Sep 10 '19 at 09:46

Simon

5,464
6
49
85

answered Mar 26 '12 at 11:34

georg

211,518
52
313
390

@EOL: just out of curiosity, what's the point of your edit? I'm no emacs user, so I'm not sure what these `-*-` are good for. – georg Mar 26 '12 at 14:14
@thg435: Good question: I thought that the `-*-` syntax was a general Python convention, but then I checked PEP 263 and discovered it was not. :) I mostly reverted the change (the new version reflects PEP 263 better, though). I did a space before the comment mark `#`, so as to follow the "at least two spaces" PEP 8 convention (reference: http://www.python.org/dev/peps/pep-0008/#inline-comments). – Eric O. Lebigot Mar 27 '12 at 03:01
2

@thg, would you consider modifying your gist to add a liberal open source license like BSD or MIT? I'd like to include it in my project, but it's enough code that I don't feel comfortable doing that unlicensed. – Reid Jan 11 '13 at 17:00
1

@Reid: this one is not referred to elsewhere and is an integral part of my post and thus, as all contributions to SO, is [licensed under CC](http://stackoverflow.com/faq#editing) by default. – georg Jan 11 '13 at 19:26
Great work! And thanks for providing that under CC (i.e. CC-by-sa). But that is problematic for code, and CC says to use either MIT, GPL or LGPL. See e.g. a lot of headaches discussed here: http://wiki.creativecommons.org/GPL_compatibility_use_cases So please pick one of those programming licenses and let us use that. Thanks again! – nealmcb Feb 04 '13 at 06:34
Thanks a lot! How do you know that all those numbers in the code are correct? – Evgeny May 04 '15 at 06:20
1

Have you had a chance to consider licensing your code under the MIT license or similar please? Please see @nealmcb's comment above as to why plain CC-by-sa is problematic for code. – Robie Basak May 24 '18 at 23:28

score 5 · Answer 2 · answered Mar 26 '12 at 11:24

It seems to me that the Python unicodedata module contains tools for accessing the main file in the Unicode database but nothing for the other files: “The data in this database is based on the UnicodeData.txt file”

The script information is in the Scripts.txt file. It is of relatively simple format (described in UAX #44) and not horribly large (131 kilobytes), so you might consider parsing it in your program. Note that in the Unicode classification, there’s the “Common” script that contains characters used in different scripts, like punctuation marks.

Eric O. Lebigot · Answer 3 · 2012-03-26T08:46:22.707

2

The only way I know of is unfortunately to get the Unicode code point with ord() and then use your own table (by using http://en.wikipedia.org/wiki/Unicode#Standardized_subsets and more). A preliminary conversion to some normal form may be in order, so as to handle the fact that a single "written" character can be expressed with different sequences of code points (the unicodedata module helps, here).

edited Mar 26 '12 at 08:46

answered Mar 26 '12 at 08:38

Eric O. Lebigot

91,433
48
218
260

score 2 · Answer 4 · answered Mar 26 '12 at 08:40

You can use ord to retrieve the numeric value of a character (it works on both unicode and byte strings of length 1).

The next step, unfortunately, will involve you then testing against the ranges. Possibly the data here will be of assistance: http://cldr.unicode.org/index/downloads

score 1 · Answer 5 · answered Dec 06 '19 at 21:26

Oftentimes it is just enough to detect if a certain script is used, and then you can use the unicodedata.name with prefix matching. For example to find out whether a letter is Cyrillic, you can use

class CharacterNamePrefixTester(dict):
    def __init__(self, prefix):
        self.prefix = prefix
    def __missing__(self, key):
        self[key] = unicodedata.name(key, '').startswith(self.prefix)
        return self[key]

>>> cyrillic = CharaterNamePrefixTester('CYRILLIC ')
>>> cyrillic['й']
True
>>> cyrillic['a']
False

The dictionary is built lazily but the truth values are memoized so that future lookups of the same letter will be faster.

yes, good idea, however there are names like 'COMBINING CYRILLIC whatever' — georg, Dec 07 '19 at 11:07

Find out the unicode script of a character

5 Answers5

Linked