1

I have zip archive, with folders inside. Folder name contains unicode characters (Georgian letters). When I extract it I'm getting wrong folder names. For example

Folder Name in Archive: 6001 SAHIN INOX 192MM სახელური

Folder Name Extracted: 6001 SAHIN INOX 192MM fossyhew

The machines where archive was created and where I'm trying to extract it are different. Here is my code with all different options I've tried, but non of them worked.

static void Main(string[] args)
    {
        var ZipFilePath = $"1.zip";
        ZipFile.ExtractToDirectory(ZipFilePath, AppDomain.CurrentDomain.BaseDirectory);
        //ZipFile.ExtractToDirectory(ZipFilePath, AppDomain.CurrentDomain.BaseDirectory, Encoding.UTF8);
        //ZipFile.ExtractToDirectory(ZipFilePath, AppDomain.CurrentDomain.BaseDirectory, new UTF8Encoding());            
        //ZipFile.ExtractToDirectory(ZipFilePath, AppDomain.CurrentDomain.BaseDirectory, Encoding.GetEncoding(1252));
        //ZipFile.ExtractToDirectory(ZipFilePath, AppDomain.CurrentDomain.BaseDirectory, Encoding.GetEncoding(850));            
    }

Archive URL: 1.zip

Project URL: Project.zip

Any ideas?

Michael Samteladze
  • 1,310
  • 15
  • 38
  • That is super-odd. I can't find a plausible way that 'სახელური' can be mojibake'd to 'fossyhew' – canton7 Apr 13 '21 at 15:50
  • Agree!!! after a day of blowing up my mind, I gave up and wrote a question here – Michael Samteladze Apr 13 '21 at 15:55
  • For context, `ს` is `E1 83 A1` in UTF-8 and `10 E1` in UTF-16 -- no matter how you spin it, I can't find a way to make `f` in any encoding from those, and I can't find any code pages which support Georgian (apart from a weird Mac OS one, which doesn't line up anyway). From googling, the relationships between the letters in `სახელური` and the letters in `fossyhew` are different too (based on what google says the order of the letters in the Georgian alphabet is) – canton7 Apr 13 '21 at 15:57
  • Have you tried .NET 5? I know some changes were made to file path handling in ZipArchive and friends as part of the move to cross-platform – canton7 Apr 13 '21 at 15:59
  • My project is .NET Framework. I've just downloaded and tried System.IO.Compression.ZipFile from nuget. Same result – Michael Samteladze Apr 13 '21 at 16:03
  • So not use Encoding.UTF8. The code should work fine without specifying an encoding method – jdweng Apr 13 '21 at 16:13
  • Unfortunately it's not, that's how I've started – Michael Samteladze Apr 13 '21 at 16:22
  • (specifying UTF8 is redundant: ZipArchive defaults to using it if you don't pass an encoding). Just try running it properly on net5.0 - you can create a simple console app to test. You'll need a [mcve] to open an issue with the runtime anyway – canton7 Apr 13 '21 at 17:01
  • Simple console app with .NET Core 5 created. Same result – Michael Samteladze Apr 13 '21 at 19:33
  • See duplicate for details on how text encoding works for .zip archive entries. – Peter Duniho Apr 14 '21 at 07:09

2 Answers2

1

I just spent an hour debugging into the .NET source to see what's going on.

The first thing to notice is that the filename in the ZIP directory header contains "fossyhew" -- that's where the value is coming from!

Debugging fossyhew

If we take a look at the extra fields for the entry in the directory header, we can see that there's an entry with tag 0x7075, and this contains your UTF-8 filename:

Debugging Georgian characters

Digging into the spec, 0x7075 is "Info-ZIP Unicode Path Extra Field" (see section 4.6.9), which is an extension that contains the UTF-8 version of the filename:

The UnicodeName is the UTF-8 version of the contents of the File Name field in the header. As UnicodeName is defined to be UTF-8, no UTF-8 byte order mark (BOM) is used. The length of this field is determined by subtracting the size of the previous fields from TSize. If both the File Name and Comment fields are UTF-8, the new General Purpose Bit Flag, bit 11 (Language encoding flag (EFS)), can be used to indicate that both the header File Name and Comment fields are UTF-8 and, in this case, the Unicode Path and Unicode Comment extra fields are not needed and SHOULD NOT be created. Note that, for backward compatibility, bit 11 SHOULD only be used if the native character set of the paths and comments being zipped up are already in UTF-8. It is expected that the same file name storage method, either general purpose bit 11 or extra fields, be used in both the Local and Central Directory Header for a file.

Thing is, ZipArchive has no support for reading the filename from the extra fields -- the only reason they're read is if you're updating an archive, so they can be written back. ZipArchive does however support that general purpose bit 11 (which it just uses to force UTF-8 decoding).

So, in summary, there are two ways to set a UTF-8 filename in ZIP: the one that your archive decided to use specifies the filename both as Code Page 437 (which is where "fossyhew" comes from) and as UTF-8 in an extra field. ZipArchive has no support for reading from this extra field.

canton7
  • 37,633
  • 3
  • 64
  • 77
  • First of all thank you for your research! I could see the same results in visual studio debugger too, but, when I just open this archive with WinRar, there are not issues at all. All looks like it suppose to. ( https://ibb.co/3f6XbQv ) This archive has not been created programmatically, it's an archive from customer's machine, who suppose to just create zip with images and upload to website, where I match image owners by folder name – Michael Samteladze Apr 14 '21 at 08:21
  • WinRar is looking at the Info-ZIP Unicode Path Extra Field, and ZipArchive isn't. That's how it is, unfortunately – canton7 Apr 14 '21 at 08:56
1

Here is a partial dump of of the metadata in 1.zip that was created using zipdetails

As canton7 has already pointed out, this zip file is using the standard filename field to store a 7-bit version of the filename (fossyhew) and using the up extra field to store the filename in Unicode (სახელური).

A83CF8 000004 50 4B 01 02 CENTRAL HEADER #E     02014B50
A83CFC 000001 1F          Created Zip Spec      1F '3.1'
A83CFD 000001 00          Created OS            00 'MS-DOS'
A83CFE 000001 0A          Extract Zip Spec      0A '1.0'
A83CFF 000001 00          Extract OS            00 'MS-DOS'
A83D00 000002 00 00       General Purpose Flag  0000
A83D02 000002 00 00       Compression Method    0000 'Stored'
A83D04 000004 88 7B 8D 52 Last Mod Time         528D7B88 'Tue Apr 13 14:28:16 2021'
A83D08 000004 00 00 00 00 CRC                   00000000
A83D0C 000004 00 00 00 00 Compressed Length     00000000
A83D10 000004 00 00 00 00 Uncompressed Length   00000000
A83D14 000002 1F 00       Filename Length       001F
A83D16 000002 5C 00       Extra Length          005C
A83D18 000002 00 00       Comment Length        0000
A83D1A 000002 00 00       Disk Start            0000
A83D1C 000002 00 00       Int File Attributes   0000
                          [Bit 0]               0 'Binary Data'
A83D1E 000004 10 00 00 00 Ext File Attributes   00000010
                          [Bit 4]               Directory
A83D22 000004 60 3A 7B 00 Local Header Offset   007B3A60
A83D26 00001F 36 30 30 31 Filename              '6001 SAHIN INOX 192MM fossyhew/'
              20 53 41 48
              49 4E 20 49
              4E 4F 58 20
              31 39 32 4D
              4D 20 66 6F
              73 73 79 68
              65 77 2F
A83D45 000002 0A 00       Extra ID #0001        000A 'NTFS FileTimes'
A83D47 000002 20 00         Length              0020
A83D49 000004 00 00 00 00   Reserved            00000000
A83D4D 000002 01 00         Tag1                0001
A83D4F 000002 18 00         Size1               0018
A83D51 000008 D8 08 3D 19   Mtime               01D73058193D08D8 'Tue Apr 13 11:28:16
              58 30 D7 01                       2021 940463200ns'
A83D59 000008 D8 08 3D 19   Ctime               01D73058193D08D8 'Tue Apr 13 11:28:16
              58 30 D7 01                       2021 940463200ns'
A83D61 000008 D8 08 3D 19   Atime               01D73058193475A0 'Tue Apr 13 11:28:16
              58 30 D7 01                       2021 884265600ns'
A83D69 000002 75 70       Extra ID #0002        7075 'up: Info-ZIP Unicode Path'
A83D6B 000002 34 00         Length              0034
A83D6D 000001 01            Version             01
A83D6E 000004 0A 3F BF 78   NameCRC32           78BF3F0A
A83D72 00002F 36 30 30 31   UnicodeName         6001 SAHIN INOX 192MM
              20 53 41 48                       სახელური/
              49 4E 20 49
              4E 4F 58 20
              31 39 32 4D
              4D 20 E1 83
              A1 E1 83 90
              E1 83 AE E1
              83 94 E1 83
              9A E1 83 A3
              E1 83 A0 E1
              83 98 2F
pmqs
  • 3,066
  • 2
  • 13
  • 22
  • Archive has been created on customer's machine with winrar software, can any old or damaged version of winrar create this king of complications? – Michael Samteladze Apr 14 '21 at 08:24
  • I wouldn't describe this as being created with damaged damaged software. The zip file is fine as long as you have the correct toolset to work with it. The issue is it is that this zip file using a feature that doesn't guarantee maximum interoperability, especially with zip libraries used in various programming languages. – pmqs Apr 14 '21 at 21:16
  • I've asked customer to downlaod latest version of WinRar and everything has been fixed after that – Michael Samteladze Apr 15 '21 at 10:07