So I read the Spolsky Article twice, this question too and tried a lot. Now I'm here.
I created a tarball of a directory structure on a Linux Machine with locale ISO-8859-1 and untarred it on Windows with 7zip. As a result, the filenames are scrambled up when I view them in Windows Explorer (and in my C# program, too): Where I expect to see a German umlaut ü
it's a ³
- No wonder, because the filenames are written to the tar file using the ISO-8859-1 codepage and Windows obviously does not know about this.
I want to fix this by renaming the files to their correct names. So I think I have to tell the program "read the filename, think of it as ISO-8859-1 and return every character as UTF-16 character."
My code to find the correct filename:
void Main()
{
string[] files = Directory.GetFiles(@"C:\test", @"*", SearchOption.AllDirectories);
var e1 = Encoding.GetEncoding("ISO-8859-1");
var e2 = Encoding.GetEncoding("UTF-16");
foreach (var f in files)
{
Console.WriteLine($"Source: {f}");
var source = e1.GetBytes(f);
var dest = Encoding.Convert(e1, e2, source);
Console.WriteLine($"Result: {e2.GetString(dest)}");
}
}
Result - nothing happend:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrl³.odt
expected Result:
Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrlü.odt
When I exchange e1 and e2 I get weird results. My brain hurts. What am I not getting?
Edit: I know that the mistake has been made earlier, but now I have wrong filenames on the Windows machine that I need to correct. However, it might not be solvable via the Encoding
-Class. I found this blog post and the author states
It turns out, this isn't a problem with the encoding at all, but the same character address meaning different things to different character sets.
In conclusion, he wrote a method to replace the characters between 130 and 173 with specific, different characters. This does not look straightforward to me, but is it possible that this is the only way? Can anyone comment on this, please?