I don't think, there is a simple way to do this. There probably is no universal normalisation (even when you limit it to group of European languages).
All solutions have have manual work:
- RegEx - It should be possible, but this solution (a RegEx expression that would do the job) would be really incredible crazy.
- There is (or at least was) a plug-in for Total Commander for transliteration. But the plug-in is/was buggy/unstable and you need to write the transliteration table manually.
- "Manual transliteration".
I have similar problem with file names. But in my case the file names contains Japanese characters. This translation/transliteration is a little bit harder.
To simplify your solution you can use the code page conversions in Windows.
It would be nice when the conversion to ASCII (7 bit) would do the job, but no. This produces only '?' characters.
This example should handle some of the characters.
Encoding encoding;
string data = "Čeština, øðrum";
encoding = Encoding.GetEncoding(1250);
data = encoding.GetString(encoding.GetBytes(data)); // "Čeština, o?rum"
encoding = Encoding.GetEncoding(1252);
data = encoding.GetString(encoding.GetBytes(data)); // "Ceština, o?rum"
encoding = Encoding.ASCII;
data = encoding.GetString(encoding.GetBytes(data));
Console.WriteLine(data); // "Ce?tina, o?rum"
It is not perfect, but at least you cleared some of the unwanted characters without the need of a substitution dictionary.
You can try to add another code pages (perhaps Greece code page would fix the "μ" problem, but it will probably remove all other characters).
After these start conversions you can search the transformed text for '?' characters and see, whether there is '?' character in the original/source. When there is not, now you can use a substitution dictionary for given character.
In my project I use substitution dictionary (updated manually in runtime by user for unknown words). When all your transliterations are only single characters, you do not need to use some special methods, but when there are cases like "ßs" --> "ss" (not as 'ß' + 's' = "ss" + 's' = "sss"), you will need a sorted list of substitutions, that need to be processed before character substitutions. The list should be sorted by string length (longer first) and not by alphabet.
Remarks:
In you case, there is probably not the problem of ambiguous transcription (明日 = "ashita" or "asu", or perhaps a different word according to surrounding characters) but you should consider if it really is so.
In my project I found out, that there are programs that store files with wrong encoding. Downloader get the correct file name in UTF-8 the sequence of bytes is interpreted as Encoding.Default
(or "Encoding.DOS
" [symbolic name], or other code page for zipped files). Therefore it would be good to test the file names for this type of error.
See how to test for invalid file name encoding:
https://stackoverflow.com/a/19068371/2826535
- Only to complete the answer:
Unicode normalisation based "remove accents" method:
https://stackoverflow.com/a/3288164/2826535