3

I am creating a *.zip using Ionic.Zip. However, my *.zip contains same files multiple times, sometimes even 20x, and the ZIP format does not take advantage of it at all.

Whats worse, Ionic.Zip sometimes crashes with an OutOfMemoryException, since I am compressing the files into a MemoryStream.

Is there a .NET library for compressing that takes advantage of redundancy between files?

Users decompress the files on their own, so it cannot be an exotic format.

Tomas Grosup
  • 6,396
  • 3
  • 30
  • 44
  • 1
    Why do you need to store the same file multiple times? – We Are All Monica Aug 28 '13 at 15:01
  • There are in different folders. The user modifies the files he wants and then sends them back (and he may want to modify just some of the versions) – Tomas Grosup Aug 28 '13 at 15:07
  • In general you should try to eliminate duplication of information. If the same file is used for multiple purposes then you could create a mapping file that indicates which files are used for each purpose. The user could then modify the mapping file to indicate that a new file of their choice should be used for a given purpose. – We Are All Monica Aug 28 '13 at 15:09
  • The user wants to see it as many individual files in a standard archive. – Tomas Grosup Aug 28 '13 at 15:41
  • The user is wrong :) Anyway, zip format should take advantage of redundancy between files extremely well. – We Are All Monica Aug 28 '13 at 16:32
  • 2
    You are wrong, zip format compresses each file individually. – Tomas Grosup Aug 29 '13 at 08:33

4 Answers4

4

I ended up creating a tar.gz using the SharpZipLib library. Using this solution on 1 file, the archive is 3kB. Using it on 20 identical files, the archive is only 6kB, whereas in .zip it was 64kB.

Nuget:

Install-Package SharpZipLib

Usings:

using ICSharpCode.SharpZipLib.GZip;
using ICSharpCode.SharpZipLib.Tar;

Code:

var output = new MemoryStream();
using (var gzip = new GZipOutputStream(output))
using (var tar = TarArchive.CreateOutputTarArchive(gzip))
            {
                for (int i = 0; i < files.Count; i++)
                {                    
                    var tarEntry = TarEntry.CreateEntryFromFile(file);                    
                    tar.WriteEntry(tarEntry,false);
                }

                tar.IsStreamOwner = false;
                gzip.IsStreamOwner = false;
            }
Tomas Grosup
  • 6,396
  • 3
  • 30
  • 44
2

No, there is no such API exposed by well-known ones (such as GZip, PPMd, Zip, LZMA). They all operate per file (or stream of bytes to be more specific).

You could catenate all the files, ie using a tar-ball format and then use compression algorithm.

Or, it's trivial to implement your own check: compute hash for a file and store it in the a hash-filename dictionary. If hash matches for next file you can decide what you want to do, such as ignore this file completely, or perhaps note its name and save it in another file to mark duplicates.

oleksii
  • 35,458
  • 16
  • 93
  • 163
  • 3
    `.tar.gz` will work fine, since it archives all the files and then compresses them. This answer is technically right, since it's a two step process of using tar and then gzip, but most decompression tools handle this seamlessly. – Mike Precup Aug 28 '13 at 15:14
  • Any .NET libraries for creating a .tar.gz? – Tomas Grosup Aug 29 '13 at 08:50
  • @TomasGrosup I never programmatically used one myself, but there is a [question on this one](http://stackoverflow.com/q/3212118/706456). – oleksii Aug 29 '13 at 09:24
2

Yes, 7-zip. There is a SevenZipSharp library you could use, but from my experience, launching a compressing process directly using command line is much faster.

My personal experience: We used a SevenZipSharp in a company to decompress archives up to 1GB and it was terribly slow until I reworked it so that it will use the 7-zip library directly by running its command line interface. Then it was as fast as it was when decompressing manually in Windows Explorer.

Ondrej Janacek
  • 12,486
  • 14
  • 59
  • 93
  • `launching compressing process directly using command line is much faster` Nothing a good old `System.Diagnostics.Process.Start()` can't solve ;) – Nolonar Aug 28 '13 at 15:15
1

I haven't tested this, but according to one answerer in How many times can a file be compressed?

If you have a large number of duplicate files, the zip format will zip each independently, and you can then zip the first zip file to remove duplicate zip information.

Community
  • 1
  • 1
Lie Ryan
  • 62,238
  • 13
  • 100
  • 144