How to translate a text to the reduced encoding?

Question

I am expected to translate a Unicode string in a Latin-based character set to the reduced encoding. The loss of information is expected. The goal is to keep it as human readable as possible.

The reduced encoding is prescribed as "Level A character set" for EDIFACT messages. It uses only capital A to Z character, numerals, and some non-alphanumeric characters. To be more explicit, consider the following parts of postal addresses. The left column contains the original text, the right column should be the result:

Karaağaç Mahallesi   ... KARAAGAC MAHALLESI 
Çerkezköy/Tekirkag   ... CERKEZKOY/TEKIRKAG
Mělník               ... MELNIK
Środa Śląska         ... SRODA SLASKA
Strada Henri Coandă  ... STRADA HENRI COANDA
Villalonquéjar       ... VILLALONQUEJAR

If there were any character that cannot be solved (or is not the part of the translation table, yet [forgotten]), then it would be replaced by question-mark.

I am aware that some foreign accented or special characters that can be transcribed (like Straße to STRASSE). This is not my goal just now (it can be in future).

Say to use the .ToUpper() method of the string solves one half of the problem. Then I can use a translation table to pair the accented character with the similar character without the accent.

The problem is that the texts (postal addresses) may be from many countries that use kind of accented or compound Latin characters, and I do not know all of such characters. Is there any information source that lists letters outside the ASCII set?

How would you do that?

PowerShell? C#? C++? Java? - please add appropriate tag. Please [edit] your question to provide a [mcve]. See **1,742 results** at https://stackoverflow.com/search?q=remove+accents — JosefZ, Mar 30 '23 at 13:59
@JosefZ: Thanks for the hints. I am new to the problem; so, I had to find the starting point. Please, formulate the answer around `NormalizationForm.FormD`, and I will happily accept it. — pepr, Apr 02 '23 at 21:43

score 1 · Accepted Answer · answered Apr 03 '23 at 17:00

Build as C# app under Visual Studio 2019:

using System;
using System.Text;
using System.Text.RegularExpressions;

namespace remove_accents
{
    class Program
    {
        /* answered Jun 21, 2022 at 18:48 by Joshua Barker
           https://stackoverflow.com/a/72705782/3439404 */
        public abstract class StringExtension
        {
            public static string RemoveDiacritics(string Text)
            {
                return new Regex(@"\p{Mn}",RegexOptions.Compiled).
                    Replace(Text.Normalize(NormalizationForm.FormD),
                        string.Empty);
            }
        }

        static void Main(string[] args)
        {
            Console.OutputEncoding = Encoding.UTF8;
            Console.WriteLine("({0} line arguments)", args.Length);
            int ii = 0;
            foreach (string arg in args)
            {
                Console.WriteLine("Arg{0} = {1} => {2}",
                   ii,
                   arg,
                   StringExtension.RemoveDiacritics(arg));
                ii++;
            }
            Console.WriteLine();

        }
    }
}

Output: remove_accents.exe Karaağaç Šárčin Mělník Straße

(4 line arguments)
Arg0 = Karaağaç => Karaagac
Arg1 = Šárčin => Sarcin
Arg2 = Mělník => Melnik
Arg3 = Straße => Straße

How to translate a text to the reduced encoding?

1 Answers1