Given a unicode character what would be the simplest way to return its script (as "Latin", "Hangul" etc)? unicodedata doesn't seem to provide this kind of feature.
-
1What do you mean by "script value"? – Daniel Roseman Mar 26 '12 at 08:29
-
2See http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file . The link says that correctly detecting the encoding always is impossible. – amit kumar Mar 26 '12 at 08:34
-
1@DanielRoseman: http://en.wikipedia.org/wiki/Script_%28Unicode%29 – georg Mar 26 '12 at 08:41
-
9phaedrus, they're not trying to detect how the codepoint is encoded, just which writing system it's from (hence "Latin", "Hangul") which is do-able so long as you're happy to accept either a vague answer or none for some codepoints. – tialaramex Mar 26 '12 at 08:41
-
2@phaedrus: I understand that that question is not about detecting encodings, but about what is called the "standardized subsets" of Unicode. Reference: http://en.wikipedia.org/wiki/Unicode#Standardized_subsets. – Eric O. Lebigot Mar 26 '12 at 08:47
5 Answers
I was hoping someone's done it before, but apparently not, so here's what I've ended up with. The module below (I call it unicodedata2
) extends unicodedata
and provides script_cat(chr)
which returns a tuple (Script name, Category) for a unicode char. Example:
# coding=utf8
import unicodedata2
print unicodedata2.script_cat(u'Ф') #('Cyrillic', 'L')
print unicodedata2.script_cat(u'の') #('Hiragana', 'Lo')
print unicodedata2.script_cat(u'★') #('Common', 'So')
The module: https://gist.github.com/2204527
-
@EOL: just out of curiosity, what's the point of your edit? I'm no emacs user, so I'm not sure what these `-*-` are good for. – georg Mar 26 '12 at 14:14
-
@thg435: Good question: I thought that the `-*-` syntax was a general Python convention, but then I checked PEP 263 and discovered it was not. :) I mostly reverted the change (the new version reflects PEP 263 better, though). I did a space before the comment mark `#`, so as to follow the "at least two spaces" PEP 8 convention (reference: http://www.python.org/dev/peps/pep-0008/#inline-comments). – Eric O. Lebigot Mar 27 '12 at 03:01
-
2@thg, would you consider modifying your gist to add a liberal open source license like BSD or MIT? I'd like to include it in my project, but it's enough code that I don't feel comfortable doing that unlicensed. – Reid Jan 11 '13 at 17:00
-
1@Reid: this one is not referred to elsewhere and is an integral part of my post and thus, as all contributions to SO, is [licensed under CC](http://stackoverflow.com/faq#editing) by default. – georg Jan 11 '13 at 19:26
-
Great work! And thanks for providing that under CC (i.e. CC-by-sa). But that is problematic for code, and CC says to use either MIT, GPL or LGPL. See e.g. a lot of headaches discussed here: http://wiki.creativecommons.org/GPL_compatibility_use_cases So please pick one of those programming licenses and let us use that. Thanks again! – nealmcb Feb 04 '13 at 06:34
-
Thanks a lot! How do you know that all those numbers in the code are correct? – Evgeny May 04 '15 at 06:20
-
1Have you had a chance to consider licensing your code under the MIT license or similar please? Please see @nealmcb's comment above as to why plain CC-by-sa is problematic for code. – Robie Basak May 24 '18 at 23:28
It seems to me that the Python unicodedata module contains tools for accessing the main file in the Unicode database but nothing for the other files: “The data in this database is based on the UnicodeData.txt file”
The script information is in the Scripts.txt file. It is of relatively simple format (described in UAX #44) and not horribly large (131 kilobytes), so you might consider parsing it in your program. Note that in the Unicode classification, there’s the “Common” script that contains characters used in different scripts, like punctuation marks.

- 195,524
- 37
- 270
- 390
The only way I know of is unfortunately to get the Unicode code point with ord()
and then use your own table (by using http://en.wikipedia.org/wiki/Unicode#Standardized_subsets and more). A preliminary conversion to some normal form may be in order, so as to handle the fact that a single "written" character can be expressed with different sequences of code points (the unicodedata module helps, here).

- 91,433
- 48
- 218
- 260
You can use ord
to retrieve the numeric value of a character (it works on both unicode and byte strings of length 1).
The next step, unfortunately, will involve you then testing against the ranges. Possibly the data here will be of assistance: http://cldr.unicode.org/index/downloads

- 48,559
- 18
- 128
- 201
Oftentimes it is just enough to detect if a certain script is used, and then you can use the unicodedata.name
with prefix matching. For example to find out whether a letter is Cyrillic, you can use
class CharacterNamePrefixTester(dict):
def __init__(self, prefix):
self.prefix = prefix
def __missing__(self, key):
self[key] = unicodedata.name(key, '').startswith(self.prefix)
return self[key]
>>> cyrillic = CharaterNamePrefixTester('CYRILLIC ')
>>> cyrillic['й']
True
>>> cyrillic['a']
False
The dictionary is built lazily but the truth values are memoized so that future lookups of the same letter will be faster.

- 129,958
- 22
- 279
- 321
-
yes, good idea, however there are names like 'COMBINING CYRILLIC whatever' – georg Dec 07 '19 at 11:07