18

I have a WCF webservice that saves files to a folder(about 200,000 small files). After that, I need to move them to another server.

The solution I've found was to zip them then move them.

When I adopted this solution, I've made the test with (20,000 files), zipping 20,000 files took only about 2 minutes and moving the zip is really fast. But in production, zipping 200,000 files takes more than 2 hours.

Here is my code to zip the folder :

using (ZipFile zipFile = new ZipFile())
{
    zipFile.UseZip64WhenSaving = Zip64Option.Always;
    zipFile.CompressionLevel = CompressionLevel.None;
    zipFile.AddDirectory(this.SourceDirectory.FullName, string.Empty);

    zipFile.Save(DestinationCurrentFileInfo.FullName);
}

I want to modify the WCF webservice, so that instead of saving to a folder, it saves to the zip.

I use the following code to test:

var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));

foreach (var additionFile in listAes)
{
    using (var zip = ZipFile.Read(nameOfExistingZip))
    {
        zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
        zip.AddFile(additionFile.FullName);

        zip.Save();
    }

    file.WriteLine("Delay for adding a file  : " + sw.Elapsed.TotalMilliseconds);
    sw.Restart();
}

The first file to add to the zip takes only 5 ms, but the 10,000 th file to add takes 800 ms.

Is there a way to optimize this ? Or if you have other suggestions ?

EDIT

The example shown above is only for test, in the WCF webservice, i'll have different request sending files that I need to Add to the Zip file. As WCF is statless, I will have a new instance of my class with each call, so how can I keep the Zip file open to add more files ?

Anas
  • 5,622
  • 5
  • 39
  • 71
  • Have you tried playing with the settings available for creating the zip? If it is taking that much longer it could be using too strong of compression. Alternatively do you need to write out all the small files or can you define a format that allows you to write out one file? You would lose the compression, but it would be easier. – Guvante May 13 '15 at 18:49
  • 2
    Why are you opening, adding a file, saving, and closing the zip file for every add? You can call `AddFile` multiple times. – Paul Abbott May 13 '15 at 18:52
  • 2
    Is repeatedly opening and saving the file to add robustness in case the process fails partway through, so you don't lose all of the files? It's likely this repeated open/save that's steadily eating up more and more time, as the file gets larger. You can potentially reduce the overhead while still preserving some robustness by saving less frequently (such as once every 100th file.) – Dan Bryant May 13 '15 at 18:52
  • I agree with @PaulAbbott, why are you updating and existing file instead of creating a new one? I't will also help to use some performance counter to see disk throughput, memory pressure, and so on with the most relevant factors that might affect performance. – Oscar May 13 '15 at 18:57
  • @Anas: You begin your question by saying that you save up 200,000 files and then zip them up and move them, but you end by indicating that you're adding files to a zip file as they get uploaded. Why don't you just add files to a directory on the disk until you reach a specific threshold, and then zip them all up at once and send them over? – StriplingWarrior May 13 '15 at 19:10
  • 1
    @StriplingWarrior In the begining, I made the assumption that if 20,000 files takes 2 min to zip, 200,000 will take 20 min, but it's not the case, 200,000 files takes more than 2 hours. So i thought that instead of saving to disk, I will save to the zip directly and that might save time. – Anas May 13 '15 at 19:15
  • @Anas: That makes more sense. How did your original code go about zipping up the files? – StriplingWarrior May 13 '15 at 19:19
  • @Anas: Is there any way you can take care of the zipping process asynchronously? Maybe just have the upload API worry about saving the files to disk, but then have a job run every so often in the background to zip up any files that have been added? – StriplingWarrior May 13 '15 at 19:33
  • But why all this mess? Why not programming a batch process that overnight zip all the files in the folder, move it to the final destination and clean up the files? Why are you trying so hard to complicate your life? – Oscar May 13 '15 at 19:37
  • @Oscar Because we receive 200,000 files per hour, I need to clean up a in real time. – Anas May 13 '15 at 19:50

5 Answers5

11

I've looked at your code and immediately spot problems. The problem with a lot of software developers nowadays is that they nowadays don't understand how stuff works, which makes it impossible to reason about it. In this particular case you don't seem to know how ZIP files work; therefore I would suggest you first read up on how they work and attempted to break down what happens under the hood.

Reasoning

Now that we're all on the same page on how they work, let's start the reasoning by breaking down how this works using your source code; we'll continue from there on forward:

var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories).Where(s => s.EndsWith(".aes")).Select(f => new FileInfo(f));

foreach (var additionFile in listAes)
{
    // (1)
    using (var zip = ZipFile.Read(nameOfExistingZip))
    {
        zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
        // (2)
        zip.AddFile(additionFile.FullName);

        // (3)
        zip.Save();
    }

    file.WriteLine("Delay for adding a file  : " + sw.Elapsed.TotalMilliseconds);
    sw.Restart();
}
  • (1) opens a ZIP file. You're doing this for every file you attempt to add
  • (2) Adds a single file to the ZIP file
  • (3) Saves the complete ZIP file

On my computer this takes about an hour.

Now, not all of the file format details are relevant. We're looking for stuff that will get increasingly worse in your program.

Skimming over the file format specification, you'll notice that compression is based on Deflate which doesn't require information on the other files that are compressed. Moving on, we'll notice how the 'file table' is stored in the ZIP file:

Zip file structure

You'll notice here that there's a 'central directory' which stores the files in the ZIP file. It's basically stored as a 'list'. So, using this information we can reason on what the trivial way is to update that when implementing steps (1-3) in this order:

  • Open the zip file, read the central directory
  • Append data for the (new) compressed file, store the pointer along with the filename in the new central directory.
  • Re-write the central directory.

Think about it for a moment, for file #1 you need 1 write operation; for file #2, you need to read (1 item), append (in memory) and write (2 items); for file #3, you need to read (2 item), append (in memory) and write (3 items). And so on. This basically means that you're performance will go down the drain if you add more files. You've already observed this, now you know why.

A possible solution

In the previous solution I have added all files at once. That might not work in your use case. Another solution is to implement a merge that basically merges 2 files together every time. This is more convenient if you don't have all files available when you start the compression process.

Basically the algorithm then becomes:

  1. Add a few (say, 16, files). You can toy with this number. Store this in -say- 'file16.zip'.
  2. Add more files. When you hit 16 files, you have to merge the two files of 16 items into a single file of 32 items.
  3. Merge files until you cannot merge anymore. Basically every time you have two files of N items, you create a new file of 2*N items.
  4. Goto (2).

Again, we can reason about it. The first 16 files aren't a problem, we've already established that.

We can also reason what will happen in our program. Because we're merging 2 files into 1 file, we don't have to do as many read and writes. In fact, if you reason about it, you'll see that you have a file of 32 entries in 2 merges, 64 in 4 merges, 128 in 8 merges, 256 in 16 merges... hey, wait we know this sequence, it's 2^N. Again, reasoning about it we'll find that we need approximately 500 merges -- which is much better than the 200.000 operations that we started with.

Hacking in the ZIP file

Yet another solution that might come to mind is to overallocate the central directory, creating slack space for future entries to add. However, this probably requires you to hack into the ZIP code and create your own ZIP file writer. The idea is that you basically overallocate the central directory to a 200K entries before you get started, so that you can simply append in-place.

Again, we can reason about it: adding file now means: adding a file and updating some headers. It won't be as fast as the original solution because you'll need random disk IO, but it'll probably work fast enough.

I haven't worked this out, but it doesn't seem overly complicated to me.

The easiest solution is the most practical

What we haven't discussed so far is the easiest possible solution: one approach that comes to mind is to simply add all files at once, which we can again reason about.

Implementation is quite easy, because now we don't have to do any fancy things; we can simply use the ZIP handler (I use ionic) as-is:

static void Main()
{
    try { File.Delete(@"c:\tmp\test.zip"); }
    catch { }

    var sw = Stopwatch.StartNew();

    using (var zip = new ZipFile(@"c:\tmp\test.zip"))
    {
        zip.UseZip64WhenSaving = Zip64Option.Always;
        for (int i = 0; i < 200000; ++i)
        {
            string filename = "foo" + i.ToString() + ".txt";
            byte[] contents = Encoding.UTF8.GetBytes("Hello world!");
            zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
            zip.AddEntry(filename, contents);
        }

        zip.Save();
    }

    Console.WriteLine("Elapsed: {0:0.00}s", sw.Elapsed.TotalSeconds);
    Console.ReadLine();
}

Whop; that finishes in 4,5 seconds. Much better.

atlaste
  • 30,418
  • 3
  • 57
  • 87
  • Hi @atlaste, thank you for your detailed answer, I don't understand the last solution as I don't have all the files initially !! – Anas Aug 26 '15 at 19:33
  • @Anas You didn't say that :-) In that case I'd either store them in a temporary location first or merge ZIP files in pairs. If all that doesn't work too, you can try to overallocate the ZIP directory table, which should give you a solution as well; that requires hacking in the ionic libs though. – atlaste Aug 27 '15 at 06:41
  • @Anas Still, I'd attempt to add them all at once. What you can do is combine a singleton pattern with a timer. Lock when you add files to the ZIP file to avoid concurrency issues. If the timer hits zero (f.ex. after 10 seconds) or if you hit file 200K, flush it to disk. I'd probably also implement IDisposable and uncaught exception handlers to make sure that in practically all cases the data gets flushed. Either ways, a temporary location is 'safer' in the case of a power failure and stuff like that. – atlaste Aug 27 '15 at 06:45
  • 1
    Probably much easier to go with the tar route or your own simple appending file format than sticking to zip, especially as meddling with the format will cause troubles if there is any kind of CRC to ensure the archive is not corrupted. Not sure if merging files would yield substantial CPU benefit (no I/O benefit for sure). – Erwin Mayer Aug 27 '15 at 16:58
  • @ErwinMayer Sorry, but I disagree. If you read the OP's question, he states that files are moved in batch. He also explicitly asks about ZIP files, which will work just fine. As for merging, I'm not sure what you're aiming at, but the IO benefit is definitely there: it's O(n log n) instead of O(n^2) and only sequential access. I do agree that TAR files can work as well - but that means you shouldn't compress it (otherwise you loose the necessary info). – atlaste Aug 28 '15 at 06:42
  • 1
    @atlaste Yes he asked about Zip files, but since they are not compressed in his code sample, I thought it wise to take this into account as actually part of the (non)-requirements. Of course if the remote server absolutely needed ZIP files there would be no way around. You could TAR compressed files to still have some compression benefits. For the merging solution you suggest, how is it I/O O(n log n)? You have to reread the whole -uncompressed so same size- archive each time to create a new one with additional files. – Erwin Mayer Aug 28 '15 at 16:20
  • @ErwinMayer Sorry for the late reaction; I didn't notice another comment until today. Basically each step in the merging solution merges two equally sized files to produce a new file with twice the input size (e.g. 32+32 -> 64). Note that you end up with multiple files during the process (that is: iff you don't have a power of 2). If you do a merge, you therefore don't have to go through all the data in most cases. In other words: the procedure and therefore the complexity is comparable to a merge-sort (which is O(n log n)). – atlaste Mar 21 '16 at 11:21
3

I can see that you just want to group the 200,000 files into one big single file, without compression (like a tar archive). Two ideas to explore:

  1. Experiment with other file formats than Zip, as it may not be the fastest. Tar (tape archive) comes to mind (with natural speed advantages due to its simplicity), it even has an append mode which is exactly what you are after to ensure O(1) operations. SharpCompress is a library that will allow you to work with this format (and others).

  2. If you have control over your remote server, you could implement your own file format, the simplest I can think of would be to zip each new file separately (to store the file metadata such as name, date, etc. in the file content itself), and then to append each such zipped file to a single raw bytes file. You would just need to store the byte offsets (separated by columns in another txt file) to allow the remote server to split the huge file into the 200,000 zipped files, and then unzip each of them to get the meta data. I guess this is also roughly what tar does behind the scene :).

  3. Have you tried zipping to a MemoryStream rather than to a file, only flushing to a file when you are done for the day? Of course for back-up purposes your WCF service would have to keep a copy of the received individual files until you are sure they have been "committed" to the remote server.

  4. If you do need compression, 7-Zip (and fiddling with the options) is well worth a try.

Community
  • 1
  • 1
Erwin Mayer
  • 18,076
  • 9
  • 88
  • 126
0

You are opening the file repeatedly, why not add loop through and add them all to one zip, then save it?

var listAes = Directory.EnumerateFiles(myFolder, "*.*", SearchOption.AllDirectories)
    .Where(s => s.EndsWith(".aes"))
    .Select(f => new FileInfo(f));

using (var zip = ZipFile.Read(nameOfExistingZip))
{
    foreach (var additionFile in listAes)
    {
        zip.CompressionLevel = Ionic.Zlib.CompressionLevel.None;
        zip.AddFile(additionFile.FullName);
    }
    zip.Save();
}

If the files aren't all available right away, you could at least batch them together. So if you're expecting 200k files, but you only have received 10 so far, don't open the zip, add one, then close it. Wait for a few more to come in and add them in batches.

DLeh
  • 23,806
  • 16
  • 84
  • 128
  • 1
    He answered this question in his comments, and his edit. He's getting the files uploaded asynchronously, over time, and he's hoping that keeping a running archive will help avoid the big 2-hour hit after all the files have been uploaded. – StriplingWarrior May 13 '15 at 19:26
0

If you are OK with performance of 100 * 20,000 files, can't you simply partition your large ZIP into a 100 "small" ZIP files? For simplicity, create a new ZIP file every minute and put a time-stamp in the name.

Vlad Feinstein
  • 10,960
  • 1
  • 12
  • 27
-1

You can zip all the files using .Net TPL (Task Parallel Library) like this:

    while(0 != (read = sourceStream.Read(bufferRead, 0, sliceBytes)))
{
   tasks[taskCounter] = Task.Factory.StartNew(() => 
     CompressStreamP(bufferRead, read, taskCounter, ref listOfMemStream, eventSignal)); // Line 1
   eventSignal.WaitOne(-1);           // Line 2
   taskCounter++;                     // Line 3
   bufferRead = new byte[sliceBytes]; // Line 4
}

Task.WaitAll(tasks);                  // Line 6

There is a compiled library and source code here:

http://www.codeproject.com/Articles/49264/Parallel-fast-compression-unleashing-the-power-of

Oscar
  • 13,594
  • 8
  • 47
  • 75
  • Thanks for your answer, but this library doesn't seems to zip folders, only files ? – Anas May 13 '15 at 20:56
  • 1
    -1 TPL is almost never the solution for speeding things up. In this case it is not, since the issue is poorly optimised I/O, similar to string concat thrashing. – Aron Aug 25 '15 at 07:43