0

In C# my application running on Windows Server 2008 R2 I need to be able to extract any ZIP file created on Windows or MAC OSX. I am currently using the DotNetZip library.

But this library has trouble extracting MACOSX ZIP archives with special nordic characters in the filenames. I have tried specifying different encodings including macintosh in the Encoding.GetEncoding(string) method.

The Windows built-in zip tool also messes up the special characters. WinRAR 3.x does as well. But WinRAR 4.x seems to be the only tool that does the job right.

Is it at all possible to extract such a ZIP archive right with any available C# ZIP libraries?

UPDATE: Here is an example of a zip archive created with the default Zip function in Mac OSX. The first screen shot shows how the Windows Zip function cannot decode the filenames. The second screen shot shows the archive opened with WinRAR 4.11:

Archive opened with Windows 7 Zip Archive opened with WinRAR 4.11

Download sample ZIP archive from Mac OSX

Lars Fastrup
  • 5,458
  • 4
  • 31
  • 35
  • Can the DotNetZips demo UI tool manage the extract? If WinRAR4 can do it, why not use that, via Process classes? – Mesh May 25 '12 at 12:11
  • Have you tried the natural "Nordic" encoding (I don't know what it might be). It sounds like you are running into one of the known pitfalls of zip files. If it is not encoded as IBM437, and not encoded as UTF8, there is no way to automatically determine what encoding the filenames use. It is possible to decode such zip files, but when reading, you need to specify the encoding used during creation. That it was created on Mac is not important. The relevant piece is the text encoding used during creation. In DotNetZip, there is an overload of `ZipFile.Read()` that lets you specify this. – Cheeso May 25 '12 at 20:13
  • Can you give a specific example of a character that DotNetZip is getting wrong—what the Mac thinks it should be vs. what DotNetZip says it is? Also, do you know what encoding the zipfile was created in? (And if not, can you post it somewhere so someone else can figure it out?) Is it possible that this is just an issue of NFD vs. NFC UTF-8? – abarnert May 25 '12 at 22:14
  • @Adrian I would rather not depend on WinRAR being installed as the application is deployed to many customer installations. – Lars Fastrup May 29 '12 at 06:42
  • @Adrian Just tried the DotNetZip demo UI tool and it works with UTF-8 encoding. I have tried that before in my code without luck - but I will now inspect the source of this tool to get it right. Thanks for the suggestion. – Lars Fastrup May 29 '12 at 06:57
  • Thanks for posting the additional info—that makes it clear that this is an NFD vs. NFC issue. Most Mac code stores Unicode in NFD, where Å (\u212B) is stored as Latin A (\u0041) followed by nonspacing ̊ (\u030A); a lot of Windows code doesn't understand NFD (and, for that matter, won't even treat Å (\u212B) as equivalent to Å (\u00C5). You either need to find a zip library that does Unicode properly, or post-process the output of the library to fix all the filenames. – abarnert May 29 '12 at 17:14
  • Info-Zip (http://www.info-zip.org/)'s Unzip 6.0/Zip 3.0 and later should handle Mac-generated Unicode filenames, and I believe the package comes with DLLs for Windows. Also, lower-level zip libraries like http://www.nih.at/libzip/ will give you plain char* paths for you to interpret however you want. – abarnert May 29 '12 at 17:23

1 Answers1

2

Did you checkout SevenZipSharp... It uses 7-Zip dll to extract archives and IMO, 7-zip is the best archive handler..

Update :

I was digging into the Example zip and DotNetZip.

With DotNetZip-WinFormsTool.exe provided in the DotNetZip binaries you can see every possible encoding in the DropDownBox.

I tried some of them including UTF-8, Zip Default(IBM437), UTF-32, Unicode etc.

I got the best result with the UTF-8 Encoding...Same reading as WinRAR...

Moreover, IMO only WinRAR is using UTF-8 for all Archives whereas other Zip tools like 7-Zip, Explorer Default Zip Viewer use Zip Default encoding which enables them to read the filenames incorrectly!

So your best option is to Stick with DotNetZip and use some codes like this :

using (ZipFile zf = new ZipFile(Application.StartupPath + "\\Arkiv.zip", new UTF8Encoding()))
{
    zf.ExtractAll(Application.StartupPath + "\\Arkiv\\");
}

This code is tested to be working by me! Note that, after you extract the filenames will be shown in UTF8 formatting in the Explorer but if you open the zip file directly, explorer uses Zip Default Encoding.

Image Showing the DotNetZip Tool in UTF-8 encoding : Image

Update 2 :

For auto-detection of the Encoding of a text you can refer to This SO Question and This Code-Project Article and UDE - C# port of Mozilla Universal Charset Detector

Community
  • 1
  • 1
Writwick
  • 2,133
  • 6
  • 23
  • 54
  • Yes, I tried 7-Zip on the command line! But it did not seem to do the trick either. – Lars Fastrup May 29 '12 at 06:53
  • Can you provide a sample file so that i can find what is wrong?? – Writwick May 29 '12 at 07:47
  • Sure - I have updated the question with a download link to a sample Zip archive. – Lars Fastrup May 29 '12 at 13:19
  • Great - thank you for all the effort so far. Now I just need to find a way to autodetect the encoding as it can differ depending on the tool and platform used to create the zip. How does WinRAR do this? – Lars Fastrup May 30 '12 at 07:43
  • For now I dont know where in the code will you detect the `Encoding` of the FileNames but you can wait for sometime, so that I can research over it.[i have to reinstall VS2010] – Writwick May 30 '12 at 08:12
  • Thanks - appreicate it! I wonder why a ZIP archives does not contain information about the encoding needed to properly decode them again. – Lars Fastrup May 30 '12 at 14:09