Romanization of Unicode text

Question

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.

Examples:

Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")

Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")

Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")

It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.

Does such a library exist?

I'm looking for something like Google Maps transliteration of place names, which uses ICU transforms. Wish Google would open-source that code. (http://research.google.com/pubs/pub36450.html and http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/papers/36450.pdf) — Anthony Faull, Mar 23 '12 at 20:34
I would think this operation is also locale-specific. Welsh and Pinyin use the same characters but probably Romanize differently :-) — wberry, Mar 24 '12 at 00:57
@wberry: Welsh uses the Latin script natively, and Pinyin is already romanized Chinese. — Michael Borgwardt, Mar 24 '12 at 10:51
Yes, but when you see the Hanzi for 'George Bush' you'd like to get 'George Bush' back. — bmargulies, Mar 24 '12 at 19:31
No... If you're looking for 乔治·布什 to give "George Bush", you're not looking at transliteration anymore but translation. The accepted transliteration for this would be "qiáozhì bùshí". As the original poster mentioned "gain insights into the pronunciation of names", I don't think returning "George Bush" helps at all, the Chinese pronunciation is "qiáozhì bùshí". — Sprachprofi, Mar 24 '12 at 20:18
Yes, and the meaning of `q` for pronunciation is quite different in Pinyin than in American English. Similarly `w` means something totally different in Welsh than in British English. That's what I meant by Romanization being locale-specific. They are very different uses of the same Latin characters. — wberry, Mar 26 '12 at 17:31
The examples are from http://cldr.unicode.org/index/cldr-spec/transliteration-guidelines. — Kirill Bulygin, Aug 18 '18 at 15:12

Sprachprofi · Answer 1 · 2012-03-24T11:00:23.033

19

The problem is a lot more complex than you think.

Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.

edited Mar 24 '12 at 11:00

answered Mar 24 '12 at 10:44

Sprachprofi

1,229
12
24

He didn't ask for Arabic or Hebrew. – bmargulies Mar 24 '12 at 19:21
8

Actually he did. "It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek" --> Arabic and Hebrew are the most widely-spoken Semitic languages. – Sprachprofi Mar 24 '12 at 19:54
3

@Sprachprofi: Refering to your "Greek, Cyrillic, Indic scripts, Georgian -> trivial,". You're naïv to think you could do even one in an hour. Yes, you can just map every cyrillic/Greek/(and whatever indic is supposed to be) character to a corresponding (set of) latin character(s). But that is transliteration, NOT ROMANIZATION. Depending on preceding and subsequent character, you'll have to implement a ruling on how this is romanized. This is an order of magnitude more difficult than simply transliterate letters. Also, you're gonna have longer than 1 hour for the transliteration if d properly. – Stefan Steiger Dec 16 '16 at 10:50
@Sprachprofi: Also, note that there are multiple ISO-norms for romanization/transliteration of cyrillic. German cyrillic-latin transliteration differs from English cyrillic-latin transliteration (e.g. Volga vs. Wolga). Doing a "library" that "transliterates" cyrillic on multiple different ISO-standards alone is much work. Doing proper romanization (using multiple different algorithms for multiple transliterations) is probably going to blow-up on costs in a commercial setting in a free-market economy. In other words, this is anything but trivial, if done properly. CrapIsAlwaysFree andAbundant. – Stefan Steiger Dec 16 '16 at 10:59
@Sprachprofi: In programming, pretty much nothing is trivial. Even a properly working "hello world" program is anything but trivial, because text can be written in one of 6'500 spoken languages, some of which don't have writing (or use runes), and some of which have strange glyphs arranged right-to-left, bottom-to-top in a console program. However, even in RTL languages, Arabic numbers are arranged left-to-right. And that still leaves any sci-fi author space to invent their own language that writes text in a diagonal, right-to-left, bottom-top, for a hollywood movie. Trivial ? Doesn't exist. – Stefan Steiger Dec 16 '16 at 11:04
@StefanSteiger I have an MA in computational linguistics, I've coded this kind of function for a variety of languages before, and yes I'm talking about romanization, not letter-to-letter transliteration. Letter-to-letter transliteration can be coded in 5 minutes in Ruby using tr(), so that gives you 55 minutes for the adjustments to turn it into proper romanization. – Sprachprofi Nov 02 '21 at 09:59
@Sprachprofi: Can I see that code ? Can you post it (any programming language welcome). I'm really interested, because I tried romanizing cyrillic first-names, and that caused me a lot of headache prior to giving up and just using a hard-coded key-value lookup (fortunately, there is not that much of variety in the selection of first-names). – Stefan Steiger Nov 03 '21 at 15:00
@StefanSteiger This is the Ruby code for the official romanisation of names in Russian passports: ` def romanise(russian_name) ru_single = "АБВГДЕЁЗИЙКЛМНОПРСТУФЫЭабвгдеёзийклмнопрстуфыэ" la_single = "ABVGDEEZIIKLMNOPRSTUFYEabvgdeeziiklmnoprstufye" ru_double = %w(Ж ж Х х Ц Ч Ш Щ Ъ ц ч ш щ ъ Ю Я ю я) la_double = %w(Zh zh Kh kh Ts Ch Sh Shch Ie ts ch sh shch ie Iu Ia iu ia) s = russian_name.tr(ru_single, la_single) ru_double.each_with_index do |letter, i| s.gsub!(letter, la_double[i]) end s.gsub!(/[Ьь]/, '') s end` – Sprachprofi Nov 04 '21 at 15:31
Couldn't get it to display the code correctly while staying within the character limit, so try it here: https://replit.com/join/lcxvyxxsrq-judithmeyer – Sprachprofi Nov 04 '21 at 15:37
@Sprachprofi: My first example: Мария => Mariia - that is wrong. Also Aleksandr is wrong, that should be Alexander – Stefan Steiger Nov 05 '21 at 08:30
@StefanSteiger You are not talking about the official post-2013 system of romanisation (for Russian passports and other ID documents) then but about European equivalent names. The equivalent names can not be automatised, e.g. in Greek the name Ιωάννης is regularly rendered as Ioannis, Giannis, Yannis, or John - you cannot presume, you have to ask people how they want to be called in English. The official romanisation in the passport will always say Ioannis though. – Sprachprofi Nov 05 '21 at 10:27

score 11 · Accepted Answer · edited May 14 '19 at 15:20

11

You can use Unidecode Sharp :

[a C#] port from Python Unidecode that itself port from Perl unidecode. (there are also PHP and Ruby implementations available)

Usage;

using BinaryAnalysis.UnidecodeSharp;

.......................................

string _Greek="Αλφαβητικός";
MessageBox.Show(_Greek.Unidecode());

string _Japan ="しんばし";
MessageBox.Show(_Japan.Unidecode());

string _Russian ="яйца Фаберже";
MessageBox.Show(_Russian.Unidecode());

I hope, it will be good for you.

edited May 14 '19 at 15:20

Skippy le Grand Gourou

6,976
4
60
76

answered Mar 01 '13 at 17:04

Kerberos

1,228
6
24
48

1

+1, and I just want to note, that there are ports of the library to Python and Perl – Igor Chubin Apr 03 '14 at 08:12
Thanks, I downloaded the dll but was Unidecode() was still not being recognized in any string. Did not know I had to add this BinaryAnalysis using... – Veverke Sep 21 '15 at 17:29

bmargulies · Answer 3 · 2012-03-24T19:22:05.203

6

I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.

edited Mar 24 '12 at 19:22

answered Mar 23 '12 at 16:26

bmargulies

97,814
39
186
310

@bmargulies What exactly is that product? And does it offer a .NET API? – 41686d6564 stands w. Palestine May 14 '19 at 03:51
The place you want to look is www.basistech.com; yes they support .net. – bmargulies May 14 '19 at 17:38

score 5 · Answer 4 · answered Mar 23 '12 at 19:38

5

The Unicode Common Locale Data Repository has some transliteration mappings you could use.

answered Mar 23 '12 at 19:38

dan04

87,747
23
163
198

score 1 · Answer 5 · answered Aug 08 '23 at 06:01

AnyAscii

AnyAscii here could be also helpful in your case as it performs conventional romanization.
They got web demo as well.
And one more, mapping.

AnyAscii gives better results [comparing to Unidecode], supports more than twice as many characters, and often has a smaller file size.

Usage

Console.WriteLine("Αλφαβητικός".Transliterate());
//Alfavitikos
        
Console.WriteLine("しんばし".Transliterate());
//shinbashi
        
Console.WriteLine("яйца Фаберже".Transliterate());
//yaytsa Faberzhe

Romanization of Unicode text

5 Answers5

AnyAscii

Usage

Linked