0

I have a one big .bak file - near 12gb. I need to split it on into multiple 2gb .gz archives in code.

And big problem is that I need to validate this archives later.

You know like when you split one file with winrar on 3 or 4 archives, and then you just push "unpack" and it will unpack them all into one file, or crash if there is not enough archives(you delete one).

I need something like this.

public void Compress(DirectoryInfo directorySelected)
{
    int writeStat = 0;

    foreach (FileInfo fileToCompress in directorySelected.GetFiles())
    {
        using (FileStream originalFileStream = fileToCompress.OpenRead())
        {
            if ((File.GetAttributes(fileToCompress.FullName) &
               FileAttributes.Hidden) != FileAttributes.Hidden & fileToCompress.Extension != ".gz")
            {
                bytesToRead = new byte[originalFileStream.Length];
                int numBytesRead = bytesToRead.Length;

                while (_nowOffset < originalFileStream.Length)
                {                                
                    writeStat = originalFileStream.Read(bytesToRead, 0, homMuchRead);

                    using (FileStream compressedFileStream = File.Create(fileToCompress.FullName + counter + ".gz"))
                    {
                        using (GZipStream compressionStream = new GZipStream(compressedFileStream,
                           CompressionMode.Compress))
                        {
                            compressionStream.Write(bytesToRead, 0, writeStat);
                        }
                    }
                    _nowOffset = _nowOffset + writeStat;                        
                    counter++;
                }
                FileInfo info = new FileInfo(directoryPath + Path.DirectorySeparatorChar + fileToCompress.Name + ".gz");
                //Console.WriteLine($"Compressed {fileToCompress.Name} from {fileToCompress.Length.ToString()} to {info.Length.ToString()} bytes.");
            }
        }
    }
}

It works well, but i don't know how to validate their count.

I have 7 archive on test object. But how to read them in one file, and validate that this file is full.

Rumata
  • 11
  • 4
  • _it's crash when i come to the end of the archive_ - What crashed? Did you get an Exception error of any kind? What were the details of the error? Please click [edit] on your question and add in those details for us to help. – gravity Jul 18 '19 at 19:23
  • fix the code delete - homMuchRead += 10000; crash with Offset plus count is larger than the length of target array. – Rumata Jul 18 '19 at 19:32
  • i can't just understand how to validate this archives after compressing, i have 7 test archives, but how to validate their count and read them? – Rumata Jul 18 '19 at 19:35
  • how to read chunks back please [check this link](https://stackoverflow.com/questions/14524909/combine-multiple-files-into-single-file/14530122#14530122) – Power Mouse Jul 18 '19 at 19:40
  • Why not use 7-zip? It has a command line interface that you can invoke from C#. – RobV8R Jul 18 '19 at 23:18
  • cause it's awfully to use comand line invoke in core net and in 2019 i suppose ) – Rumata Jul 20 '19 at 09:28

1 Answers1

0

GZip format doesn’t natively supports what you want.

Zip does, the feature is called “spanned archives” but the ZipArchive class from .NET doesn’t. You’ll need a third-party library for that, such as DotNetZip.

But there’s workaround.

Create a class that inherits from Stream abstract one, to the outside pretends it’s a single stream that can write but not read or seek, in the implementation writes to multiple pieces, 2GB/each. Use .NET provided FileStream in the implementation. Keep track of the total length written, in a long field of your class. As soon as the next Write() call gonna exceed 2GB, write just enough bytes to reach 2GB, close and dispose the underlying FileStream, open another file with the next file name, reset file length counter to 0, and write the remaining bytes from the buffer you got to the Write() call. Repeat until closed.

Create an instance of your custom stream, pass to the constructor of GZipStream, and copy the complete 12GB source data into the GZipStream.

If you’ll do it right, on output you’ll have files exactly 2GB in size (except the last one).

To read and decompress them, you’ll need to implement similar trick with custom stream. Write a stream class that concatenates multiple files on the fly, pretending it’s a single stream, but this time you only need to implement Read() method. Give that concatenating stream to the GZipStream from the framework. If you’ll reorder or destroy some parts, there’s very high (but not 100%) probability GZipStream will fail to decompress, complaining about CRC checksums.

P.S. To implement and debug the above 2 streams, I recommend using much smaller dataset, e.g. 12 MB of data, splitting into 1MB compressed pieces. Once you’ll make it work, increase the constant and test with the complete 12GB of data.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Thank's a lot for your answer, it realy helpfull. I handle this task, create read and writing in cycles and streams, dotnetzip - probably nope, cause program must be crossplatform, but i think about it. About decompression - no gzipstream read them as usual. CRC - well think about them later, i think it helps, if we write CRC in the begin of file. – Rumata Jul 20 '19 at 09:22
  • @Rumata GZip format already includes CRC checksum, it will be written in the last piece: https://en.wikipedia.org/wiki/Gzip#File_format If it won’t match when you decompress, I think GZipStream should throw an exception telling that. – Soonts Jul 20 '19 at 17:01
  • @Rumata Also, for the reading stream, you can try this implementation: https://www.c-sharpcorner.com/article/combine-multiple-streams-in-a-single-net-framework-stream-o/ It's quite slow however, uses linear search for every read.. – Soonts Jul 20 '19 at 17:18