4

We have one requirement to extract large size .zip files (around 3 - 4 GB size) in Blob Container to other Blob Container and the extracted files are Jason files (around 35 -50GB size).

For implementation been referred code from this link: https://msdevzone.wordpress.com/2017/07/07/extract-a-zip-file-stored-in-azure-blob/ and able to extract files lesser sizes 40MB unzipping to 400MB in few minutes but getting stuck more than an hour with 2 GB file sizes extracting to 30GB JSON files.

Could anyone suggest whether any better solution they come across this scenario not using file operations?

Please below code reference we worked on:

CloudBlockBlob blockBlob = container.GetBlockBlobReference(filename);
BlobRequestOptions options = new BlobRequestOptions();
options.ServerTimeout = new TimeSpan(0, 20, 0);

// Save blob(zip file) contents to a Memory Stream.
using (MemoryStream zipBlobFileStream = new MemoryStream())
{
    //blockBlob.Properties.LeaseDuration
    blockBlob.DownloadToStream(zipBlobFileStream, null, options);
    zipBlobFileStream.Flush();
    zipBlobFileStream.Position = 0;
    //use ZipArchive from System.IO.Compression to extract all the files from zip file
    using (ZipArchive zip = new ZipArchive(zipBlobFileStream, ZipArchiveMode.Read, true))
    {
        //Each entry here represents an individual file or a folder
        foreach (var entry in zip.Entries)
        {
            //creating an empty file (blobkBlob) for the actual file with the same name of file
            var blob = extractcontainer.GetBlockBlobReference(entry.FullName);
            using (var stream = entry.Open())
            {
                //check for file or folder and update the above blob reference with actual content from stream
                if (entry.Length > 0)
                    blob.UploadFromStream(stream);
            }
        }
    }
}
leftL
  • 158
  • 6
sai kumar
  • 53
  • 2
  • 5

3 Answers3

1

Using Azure Storage File Share this is the only way it worked for me without loading the entire ZIP into Memory. I tested with a 3GB ZIP File (with thousands of files or with a big file inside) and Memory/CPU was low and stable. Maybe you can adapt to BlockBlobs. I hope it helps!

var zipFiles = _directory.ListFilesAndDirectories()
    .OfType<CloudFile>()
    .Where(x => x.Name.ToLower().Contains(".zip"))
    .ToList();

foreach (var zipFile in zipFiles)
{
    using (var zipArchive = new ZipArchive(zipFile.OpenRead()))
    {
        foreach (var entry in zipArchive.Entries)
        {
            if (entry.Length > 0)
            {
                CloudFile extractedFile = _directory.GetFileReference(entry.Name);

                using (var entryStream = entry.Open())
                {
                    byte[] buffer = new byte[16 * 1024];
                    using (var ms = extractedFile.OpenWrite(entry.Length))
                    {
                        int read;
                        while ((read = entryStream.Read(buffer, 0, buffer.Length)) > 0)
                        {
                            ms.Write(buffer, 0, read);
                        }
                    }
                }
            }
        }
    }               
}
rGiosa
  • 355
  • 1
  • 4
  • 16
0

The approach you referenced won't work because it use a memory stream and the following line would cause out of memory as it loads all the data into memory.

blob.DownloadToStream(memoryStream);

To resolve this, I followed the instructions of this blog post. The only change I made to the code is adding await to this line,

await blockBlob.UploadFromStreamAsync(fileStream);

Hope this helps.

leftL
  • 158
  • 6
  • The link was very useful, but wont myBlob.DownloadToStreamAsync(blobMemStream); will download the entire file into memory? – Tarostar Nov 29 '21 at 09:58
-1

If you need to unzip a large number of files that are sitting in Azure Storage then one option is to use Azure Batch.

Azure Batch enables you to run large-scale parallel and high performance computing (HPC) applications efficiently in the cloud.

It will manage the cluster of compute for you and all you have to worry about is creating your logic and submitting to the batch service for execution across the nodes.

You could use the Stream function, download the blob as a stream, use the ZipArchive class to extract it and then upload it to the output container.

using (Stream memoryStream = new MemoryStream())
      {
          blob.DownloadToStream(memoryStream);
          memoryStream.Position = 0; //Reset the stream

          ZipArchive archive = new ZipArchive(memoryStream);
          Console.WriteLine("Extracting {0} which contains {1} files", blobName, archive.Entries.Count);
          foreach (ZipArchiveEntry entry in archive.Entries)
          {
               CloudBlockBlob blockBlob = outputContainer.GetBlockBlobReference(entry.Name);

               blockBlob.UploadFromStream(entry.Open());
               Console.WriteLine("Uploaded {0}", entry.Name);
          }
      }

For more detail code, you could refer to this sample.

Joey Cai
  • 18,968
  • 1
  • 20
  • 30
  • 1
    I think you get it wrong, the question is about how to process a large file rather than a large number of files. – Thomas Mar 09 '18 at 07:53