Is it possible to convert characters from one language to another language's character using unicode matching?

Question

I want to translate English language to x language, for that Firstly, convert English characters to equivalent English Unicode then convert English Unicode to x Unicode then x Unicode to x characters. So, i want to convert one language Unicode to equivalent other language Unicode using c language or any other.

for Example, i want to convert "Linux" (ex word) from English to Tamil "லினக்ஸ்"

Unicode for "Linux" (ex word) : 004c 0069 006e 0075 0078

is their possibility to convert this English Unicode to Tamil equivalent Unicode ?

Unicode is Unicode. The standard has evolved over the years, but all languages share the same Unicode, that's the whole point of it. In the olden days IBM (for example) had different character sets for different languages. Unicode has replaced all that. — cdarke, Jan 23 '16 at 08:41
Maybe you are thinking of replacing single byte characters (e.g. ASCII or ISO Latin 1) with multi-byte? In python see the `codecs` module, in C see http://stackoverflow.com/questions/11576846/convert-ascii-string-to-unicode-windows-pure-c. If you use Python 3, or Java, or C#, native strings are Unicode anyway. — cdarke, Jan 23 '16 at 08:49
It's unclear what you _really_ want to do. Can you give some example inputs and outputs? — PM 2Ring, Jan 23 '16 at 10:00
You seem to be looking for transliteration, but there is no single well-defined mapping from the features of one script to those of another. I'm not familiar with Tamil, but even languages using the same script often use incompatible conventions. For example, the English word *tape* has been loaned into Finnish as *teippi.* — tripleee, Jan 24 '16 at 10:57
The fact the question displays both English and Tamil *Linux* is what **Unicode** is about. — artless noise, Feb 06 '16 at 22:13

score 3 · Accepted Answer · answered Jan 23 '16 at 08:44

3

You can't do the step "convert English unicode to x language unicode". Unicode is an encoding, where each character from every language has unique code point, so there's no thing as "English unicode" or, "x language unicode" - it's a single encoding type. I.e. for letter "i" there is a representation 0x2A (not a real code point, just to explain) and 0x2A in unicode will always be "i" independent on language.

answered Jan 23 '16 at 08:44

Nikita

6,101
2
26
44

3

Forgive me, but 0x002A is `*`, or was this an unconscious use of the answer to life, the universe, and everything? – cdarke Jan 23 '16 at 08:54
1

As stated in brackets 0x2A "not a real code point" for my example with "i". Of course, since unicode is made compatible with ASCII, therefore any number from 0 to 128 will be a legal unicode codepoint. But I got your point on "42" and yes that was unconscious. :) – Nikita Jan 23 '16 at 08:58
1

I think it is correct to say that 0 to 255 in Unicode is ISO Latin 1 http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT. 0 to **127** is ASCII, and yes, that is incredibly pedantic. – cdarke Jan 23 '16 at 09:02
Thank you, but can't correct that now, so consider "from 0 to 128(*non-inclusive*)" implied. ;) – Nikita Jan 23 '16 at 09:08

Is it possible to convert characters from one language to another language's character using unicode matching?

1 Answers1