16

I am trying to find a resource that can be used to connect Languages (or more probably Scripts) to blocks of Unicode characters. Such a resource would be used to lookup questions such as "What Unicode Blocks are used in French?" or "What languages use the block from 0A80-0AFF (http://unicodinator.com/#Block-Gujarati)?" Do you know of such a resource?

I would have expected to be able to find this information easily at unicode.org. I was quickly able to find a great table that relates Country Codes to Languages (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html). But I've spent quite a bit of time poking around with no luck finding something that relates Unicode Blocks to Languages. Its possible I've got a terminology issue blocking me from connecting the dots here...

I am not picky about exactly what is meant by "language" (Java Locale code or ISO 639 code or whatever) in this case. I also understand that there may not be exact answers because, for instance, an Arabic document can contain Latin and other text in addition to characters from the Arabic blocks (http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement). But surely there must be some table that says "these languages go with these blocks"... I'm also not picky about the format (XML, CSV, whatever), I can easily transform this into data I can use for my application. And again, I do realize the reference would probably connect Scripts to Blocks, not Languages (though Scripts can be mapped to Languages).

I do realize this will be a many-to-many table (since many languages use characters from multiple blocks, and many blocks are used by multiple languages); I do realize this cannot be precisely answered since Unicode codepoints are not language specific -- however, neither can the question of "what languages are there in this country" (answer is probably "most of them" for most countries), yet a table like this (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) is still possible to create, meaningful and useful.

As to why I'd want such a thing: I would like to enhance http://unicodinator.com with global heat-maps for the code blocks, and lists of languages; I also have a game concept I am tinkering with. Beyond that, there are probably many other uses other people could have for this (font creation? heuristic, quick, best-guess language detection now that the Google Translate API is going away? research projects?).

jwl
  • 10,268
  • 14
  • 53
  • 91
  • What about blocks that can belong to multiple languages? – Ignacio Vazquez-Abrams Jun 21 '11 at 22:50
  • 1
    yes @Ignacio, there will definitely be a many to many relationship. – jwl Jun 21 '11 at 22:52
  • 1
    I don't think this is answerable. Consider words borrowed from other languages. English doesn't normally have accents, but you'll find "résumé" in any English dictionary. – Joe White Jun 21 '11 at 23:38
  • @Joe - yes, this is a wrinkle, but still it should be possible to map these generally for the core of the language, without imported words or unusual forms. A statistically close answer should be possible even if an exact one isn't – jwl Jun 21 '11 at 23:42
  • I really don’t think you can do this. What is the real purpose? Consider how the OED alone uses many, *many* post-ASCII characters. I bet there are tens of thousands of entries that have non-ASCII in them. There’s nothing I can imagine even starting to begin to maybe work that doesn’t involve N-gram analysis of **extremely** extensive corpora to develop a set of statistical models of varying confidence levels. Why are you trying to do this? What is the real goal? – tchrist Jun 22 '11 at 17:19
  • 1
    @tchrist - see my last paragraph regarding the real purpose. Yes, the OED contains non-ASCII characters. But I am sure that 90+% of the characters are ASCII; A typical Russian newspaper probably contains mostly Cyrillic, etc. Everyone seems to be latching onto the fact it would be impossible to do this *exactly*, however I am only asking for a summarization based on the core language. Nobody could really claim that it's impossible to identify the CJK blocks are mostly being used by Chinese, Japanese, or Korean; or that Arabic language text mostly uses codepoints from the Arabic blocks... – jwl Jun 22 '11 at 17:40
  • Thankyou for asking this question, I was also looking for such a thing. (I have the less noble cause that I'm tweaking XPenguins and want them to talk to each other with little speech bubbles where each penguin speaks some random characters but from the same "language" :P) – mathematical.coffee Nov 03 '12 at 00:29

4 Answers4

13

I got an answer from Unicode.org themselves! In the CLDR subproject, there are documents such as:

for each language id, which you can search for "exemplarCharacters":

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F]</exemplarCharacters>
<exemplarCharacters type="currencySymbol" draft="contributed">[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="index" draft="contributed">[ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي]</exemplarCharacters>

Or, there is this page: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html with what looks like all of them. I will work on reshuffling this data into a langid -> blockid map of some kind, at which I will probably aware @borrible the "Answer" (rather than make mine the answer).

jwl
  • 10,268
  • 14
  • 53
  • 91
3

How about generating (approximate) data yourself? One example could be to use the different language wikipedias - download enough data in each language, generate a list of the characters used in the documents with counts, and put in a threshold to get rid of small instances of borrowed text from other languages. It would be approximate but possibly a good starting point.

borrible
  • 17,120
  • 7
  • 53
  • 75
  • I am actually going to do almost exactly that unless someone can point me to something that exists already. – jwl Jun 22 '11 at 16:55
2

I don't think that CLDR's exemplarCharacters will give accurate results. You can find for each character it's script property from UCD project's Scripts.txt and ScriptExtensions.txt files. For more read this (Unicode Script Property)

After you have the script, you can relate it to language in CLDR using the languageData section of the supplementalData.xml

Panos Theof
  • 1,450
  • 1
  • 21
  • 27
0

There is no such resource and for simple reason: Unicode code point assignments are language independent. Thus each code point could be used by multiple languages.

Of course there are certain characters that map directly to one language but in general each code point is meant to be shared. Therefore it does not make much sense to create code point to language tables.

If you are looking for ways to detect language, definitely this is not the way to go.

Paweł Dyda
  • 18,366
  • 7
  • 57
  • 79
  • Again, I realize all this, however it is obvious that some blocks tie to some specific languages or sets of languages (Arabic, Cyrillic, CJK...). Not for all blocks or code points, but at least some. So it seems reasonable to believe this should be documented somewhere – jwl Jun 22 '11 at 12:49
  • Cyrillic isn't a language, it's a script. Arabic is both a language and a script, but the script is used for many languages other than Arabic. I think your general enterprise will at best work for looking up scripts, not languages... – Kerrek SB Jun 23 '11 at 00:37
  • @larson4: That said, Unicode is divided into subranges. Some of them are contiguous, others (like Latin) are heavily fragmented, but you could in principle build a lookup table (like [this website](http://www.fileformat.info/info/unicode/char/search.htm) does) and associate the name of the subrange to each codepoint. – Kerrek SB Jun 23 '11 at 10:39
  • @Pawel Dyda you're technically wrong and right. People say "languages" but really mean scripts / writing systems when referencing the web. And as such scripts are representable by unicode ranges by definition. – Overflow2341313 Feb 25 '21 at 05:57