6

I am facing a problem when searching on filenames with unicode characters. Those files may having correct or altered names (with replaced equivalent ascii characters). I would like to make some code to find files using same words, altered or not, with possible incoherent mix of culture inside the same string. To keep it simple, I should only manage strings in European languages.

Equivalence examples :

Ɛpsilon <=> epsilon
København <=> Kobenhavn
Ångström <=> Angstrom
El Niño <=> El Nino
Tiếng Việt <=> Tieng Viet
Čeština <=> Cestina
encyklopædi <=> encyklopaedi
Expediția <=> Expeditia
øðrum <=> odrum
œuf <=> oeuf
μ (\u03bc) <=> µ (\u00b5)
Straße <=> Strasse

I already found some answers to similar questions, but they are based on simpler string (where removing accent is enough, using Unicode normalization and the drop of diacritics), or "do it yourself" based.

How to compare Unicode characters that "look alike"?

How to convert a Unicode character to its ASCII equivalent

Replacing characters in C# (ascii)

Unfortunately, Unicode normalization (the automatic way) does not work at least on following characters :

Ɛ ø ð => missing equivalence
æ œ ß => missing expansion

Is there a function/library to achieve this in C#, other that manually converting each 'well known' character myself ?

Community
  • 1
  • 1
  • 1
    A small note, why there is no universal transcription function. Let's say, that 'Ɛ' → 'E' (similar reading, similar appearance). But what with 'μ'? Reading 'm', appearance 'u'. And you example convert 'œ' to “oe” (double appearance), but Windows convert it to 'o' (first/basic character?). What with 'ä'? It can be written as 'ae' in German, but not in any other language. And what with Russian and other country specific characters? Almost each country has a different transcription rules for foreign characters. – Julo Jan 08 '16 at 07:53
  • In fact what I expected is some kind of projection function p(x) (as in math), to convert any string to simple alphanum characters, so with 2 strings which may be altered, p(s1) and p(s2) are similar, even if s1 and s2 are not. It looks like there is no global solution, I must use at least 2 steps : first ask a user for equivalence when special characters are found, then compare the strings. I can store 'well known' equivalences to make it grow along time. – Torben Hendersen Jan 08 '16 at 13:46
  • I am facing the same problem in extracting asian zip files to a PC set to English language by default. The filenames come out as invalid. I'd like to automatically convert them to something as useful as possible. What you have here comes pretty close. Did you find a solution by now that you can share the implementation of? A start of it I found here https://stackoverflow.com/a/3288164/708262, but it does not do anything for asian characters. – Ingmar Feb 27 '20 at 19:12

1 Answers1

2

I don't think, there is a simple way to do this. There probably is no universal normalisation (even when you limit it to group of European languages).

All solutions have have manual work:

  1. RegEx - It should be possible, but this solution (a RegEx expression that would do the job) would be really incredible crazy.
  2. There is (or at least was) a plug-in for Total Commander for transliteration. But the plug-in is/was buggy/unstable and you need to write the transliteration table manually.
  3. "Manual transliteration".

I have similar problem with file names. But in my case the file names contains Japanese characters. This translation/transliteration is a little bit harder.

To simplify your solution you can use the code page conversions in Windows. It would be nice when the conversion to ASCII (7 bit) would do the job, but no. This produces only '?' characters.

This example should handle some of the characters.

  Encoding encoding;
  string data = "Čeština, øðrum";

  encoding = Encoding.GetEncoding(1250);
  data = encoding.GetString(encoding.GetBytes(data)); // "Čeština, o?rum"
  encoding = Encoding.GetEncoding(1252);
  data = encoding.GetString(encoding.GetBytes(data)); // "Ceština, o?rum"
  encoding = Encoding.ASCII;
  data = encoding.GetString(encoding.GetBytes(data));
  Console.WriteLine(data); // "Ce?tina, o?rum"

It is not perfect, but at least you cleared some of the unwanted characters without the need of a substitution dictionary. You can try to add another code pages (perhaps Greece code page would fix the "μ" problem, but it will probably remove all other characters).

After these start conversions you can search the transformed text for '?' characters and see, whether there is '?' character in the original/source. When there is not, now you can use a substitution dictionary for given character.

In my project I use substitution dictionary (updated manually in runtime by user for unknown words). When all your transliterations are only single characters, you do not need to use some special methods, but when there are cases like "ßs" --> "ss" (not as 'ß' + 's' = "ss" + 's' = "sss"), you will need a sorted list of substitutions, that need to be processed before character substitutions. The list should be sorted by string length (longer first) and not by alphabet.

Remarks:

  1. In you case, there is probably not the problem of ambiguous transcription (明日 = "ashita" or "asu", or perhaps a different word according to surrounding characters) but you should consider if it really is so.

  2. In my project I found out, that there are programs that store files with wrong encoding. Downloader get the correct file name in UTF-8 the sequence of bytes is interpreted as Encoding.Default (or "Encoding.DOS" [symbolic name], or other code page for zipped files). Therefore it would be good to test the file names for this type of error.

See how to test for invalid file name encoding: https://stackoverflow.com/a/19068371/2826535

  1. Only to complete the answer:

Unicode normalisation based "remove accents" method: https://stackoverflow.com/a/3288164/2826535

Community
  • 1
  • 1
Julo
  • 1,102
  • 1
  • 11
  • 19