I am expected to translate a Unicode string in a Latin-based character set to the reduced encoding. The loss of information is expected. The goal is to keep it as human readable as possible.
The reduced encoding is prescribed as "Level A character set" for EDIFACT messages. It uses only capital A to Z character, numerals, and some non-alphanumeric characters. To be more explicit, consider the following parts of postal addresses. The left column contains the original text, the right column should be the result:
Karaağaç Mahallesi ... KARAAGAC MAHALLESI
Çerkezköy/Tekirkag ... CERKEZKOY/TEKIRKAG
Mělník ... MELNIK
Środa Śląska ... SRODA SLASKA
Strada Henri Coandă ... STRADA HENRI COANDA
Villalonquéjar ... VILLALONQUEJAR
If there were any character that cannot be solved (or is not the part of the translation table, yet [forgotten]), then it would be replaced by question-mark.
I am aware that some foreign accented or special characters that can be transcribed (like Straße
to STRASSE
). This is not my goal just now (it can be in future).
Say to use the .ToUpper()
method of the string solves one half of the problem. Then I can use a translation table to pair the accented character with the similar character without the accent.
The problem is that the texts (postal addresses) may be from many countries that use kind of accented or compound Latin characters, and I do not know all of such characters. Is there any information source that lists letters outside the ASCII set?
How would you do that?