0

So I read the Spolsky Article twice, this question too and tried a lot. Now I'm here.

I created a tarball of a directory structure on a Linux Machine with locale ISO-8859-1 and untarred it on Windows with 7zip. As a result, the filenames are scrambled up when I view them in Windows Explorer (and in my C# program, too): Where I expect to see a German umlaut ü it's a ³ - No wonder, because the filenames are written to the tar file using the ISO-8859-1 codepage and Windows obviously does not know about this.

I want to fix this by renaming the files to their correct names. So I think I have to tell the program "read the filename, think of it as ISO-8859-1 and return every character as UTF-16 character."

My code to find the correct filename:

void Main()
{
    string[] files = Directory.GetFiles(@"C:\test", @"*", SearchOption.AllDirectories);
    var e1 = Encoding.GetEncoding("ISO-8859-1");
    var e2 = Encoding.GetEncoding("UTF-16");
    foreach (var f in files)
    {
        Console.WriteLine($"Source: {f}");
        var source = e1.GetBytes(f);
        var dest = Encoding.Convert(e1, e2, source);
        Console.WriteLine($"Result: {e2.GetString(dest)}");
    }
}

Result - nothing happend:

Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrl³.odt

expected Result:

Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrlü.odt

When I exchange e1 and e2 I get weird results. My brain hurts. What am I not getting?

Edit: I know that the mistake has been made earlier, but now I have wrong filenames on the Windows machine that I need to correct. However, it might not be solvable via the Encoding-Class. I found this blog post and the author states

It turns out, this isn't a problem with the encoding at all, but the same character address meaning different things to different character sets.

In conclusion, he wrote a method to replace the characters between 130 and 173 with specific, different characters. This does not look straightforward to me, but is it possible that this is the only way? Can anyone comment on this, please?

Wolfgang Jacques
  • 769
  • 6
  • 15
  • Your terminal probably doesn't support `UTF-16` encoding – Tal Mar 01 '20 at 15:06
  • Which terminal? I don't think this is the point here. tar shouldn't care about the terminal I use, should it? – Wolfgang Jacques Mar 01 '20 at 15:18
  • After the conversion, write these string to a file (it will use UTF-8 by default, what you should have used from the beginning), see what you get. Try also CodePage 1252 instead of `ISO-8859-1`. – Jimi Mar 01 '20 at 15:25
  • The terminal obviously won't display `UTF-16` when using `Console.WriteLine`, same with window's file explorer. on the other hand `7zip`'s explorer will display them properly. – Tal Mar 01 '20 at 15:39
  • @Tal No, `7zip's` explorer in Windows shows the wrong name, too - same as Windows explorer. – Wolfgang Jacques Mar 01 '20 at 20:13

1 Answers1

0

After some more reading I got the solution myself. This excellent article helped. The point is: Once a wrong encoding was used, you can only guess (or have to know) what went wrong exactly. If you know, you can revert the whole thing in code.

void Main()
{
    // We get the source string e.g. reading files from a directory. We see a "³" when 
    // we expect a German umlaut "ü". The reason can be a poorly configured smb share
    // on a Linux server or other problems.
    string source = "M³nch";

    // We are in a .NET program, so the source string (here in the 
    // program) is Unicode in UTF-16 encoding. I.e., the codepoints 
    // M, ³, n, c and h are encoded in UTF-16.

    byte[] bytesFromSource = Encoding.Unicode.GetBytes(source); // 
    // The source encoding is UTF-16, hence we get two bytes per character.

    // We accidently worked with the OEM850 Codepage, we now have look up the bytes of 
    // the codepoints on the OEM850 codepage: We convert our bytesFromSource to the wrong Codepage
    byte[] bytesInWrongCodepage = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(850), bytesFromSource);

    // Here's the trick: Although converting to OEM850, we now assume that the bytes are Codepage ISO-8859-1.
    // We convert the bytes from ISO-8859-1 to Unicode.
    byte[] bytesFromCorrectCodepage = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, bytesInWrongCodepage);

    // And finally we get the right character.
    string result = Encoding.Unicode.GetString(bytesFromCorrectCodepage);

    Console.WriteLine(result); // Münch
}

CAVEAT: Do not run this method over its results. This is likely to produce non-printable characters or other mayhem.

Wolfgang Jacques
  • 769
  • 6
  • 15
  • @https://stackoverflow.com/users/423780/steve-mcgill You headed me to the solution. I think it's worthwhile to compare it to yours from your blogpost. – Wolfgang Jacques Mar 02 '20 at 11:19