Rough Unicode -> Language Code without CLDR?

Question

I am writing a dictionary app. If a user types an Unicode character I want to check which language the character is.

e.g.

字 - returns ['zh', 'ja', 'ko'] 
العربية - returns ['ar']
a - returns ['en', 'fr', 'de'] //and many more
й - returns ['ru', 'be', 'bg', 'uk']

I searched and found that it could be done with CLDR https://stackoverflow.com/a/6445024/41948

Or Google API Python - can I detect unicode string language code?

But in my case

Looking up a large charmap db seems cost a lot of storage and memory
Too slow to call an API, besides it requires a network connection
don't need to be very accurate. just about 80% correct ratio is acceptable
simple & fast is the main requirement
it's OK to just cover UCS2 BMP characters.

Any tips?

I need to use this in Python and Javascript. Thanks!

It might help, in evaluating possible approaches, to know why you would do this. What would you do with the information that letter “a” is used in some large list of languages? — Jukka K. Korpela, Feb 01 '13 at 09:48
Maybe "a" is just a bad example. As I mentioned, I am writing a dictionary app, which means I can provide additional information (or ads) based on the language the user is trying to looking up. — est, Feb 01 '13 at 10:05
I think “a” is a good example: there would be hundreds of possible languages, so it would be rather difficult to guess *the* language. — Jukka K. Korpela, Feb 01 '13 at 11:14

m.brindley · Accepted Answer · 2013-02-01T09:41:46.360

Would it be sufficient to narrow the glyph down to language families? If so, you could create a set of ranges (language -> code range) based on the mapping of BMP like the one shown at http://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane or the Scripts section of the Unicode charts page - http://www.unicode.org/charts/

Reliably determining parent language for glyphs is definitely more complicated because of the number of shared symbols. If you only need 80% accuracy, you could potentially adjust your ranges for certain languages to intentionally include/leave out certain characters if it simplifies your ranges.

Edit: I re-read through the question you referenced CLDR from and the first answer regarding code -> language mapping. I think that's definitely out of the question but the reverse seems feasible if a bit computationally expensive. With clever data structuring, you could identify language families and then drill down to the actual language ranges from there, reducing traversals through irrelevant language -> range pairs.

babbageclunk · Answer 2 · 2013-02-01T15:09:11.880

If the number of languages is relatively small (or the number you care about is fairly small), you could use a Bloom filter for each language. Bloom filters let you do very cheap membership tests (which can have false positives) without having to store all of the members (in this case the code points) in memory. Then you build your result set by checking the code point against each language's preconstructed filter. It's tuneable - if you get too many false positives, you can use a larger size filter, at the cost of memory.

There are Bloom filter implementations for Python and Javascript. (Hey - I've met the guy who did this one! http://www.jasondavies.com/bloomfilter/)

Bloom filters: http://en.m.wikipedia.org/wiki/Bloom_filter

Doing a bit more reading, if you only need the BMP (65,536 code points), you could just store a straight bit set for each language. Or a 2D bitarray for language X code point.

How many languages do you want to consider?

I actually really like the idea of using a bloom filter and pre-populating with a bunch of Wikipedia articles in plaintext from various language wikis but international characters is such a huge set; I get the feeling that k and m would have to be obnoxiously large. — m.brindley, Feb 01 '13 at 09:59
Well, that depends how sensitive est is to false positives, I guess - I think it would need a bit of experimenting to find the sweet spot (if it exists). I like that it's simple and fast, though. — babbageclunk, Feb 01 '13 at 10:24

Rough Unicode -> Language Code without CLDR?

2 Answers2