12

I need help converting a VERY LARGE binary file (ZIP file) to a Base64String and back again. The files are too large to be loaded into memory all at once (they throw OutOfMemoryExceptions) otherwise this would be a simple task. I do not want to process the contents of the ZIP file individually, I want to process the entire ZIP file.

The problem:

I can convert the entire ZIP file (test sizes vary from 1 MB to 800 MB at present) to Base64String, but when I convert it back, it is corrupted. The new ZIP file is the correct size, it is recognized as a ZIP file by Windows and WinRAR/7-Zip, etc., and I can even look inside the ZIP file and see the contents with the correct sizes/properties, but when I attempt to extract from the ZIP file, I get: "Error: 0x80004005" which is a general error code.

I am not sure where or why the corruption is happening. I have done some investigating, and I have noticed the following:

If you have a large text file, you can convert it to Base64String incrementally without issue. If calling Convert.ToBase64String on the entire file yielded: "abcdefghijklmnopqrstuvwx", then calling it on the file in two pieces would yield: "abcdefghijkl" and "mnopqrstuvwx".

Unfortunately, if the file is a binary then the result is different. While the entire file might yield: "abcdefghijklmnopqrstuvwx", trying to process this in two pieces would yield something like: "oiweh87yakgb" and "kyckshfguywp".

Is there a way to incrementally base 64 encode a binary file while avoiding this corruption?

My code:

        private void ConvertLargeFile()
        {
           FileStream inputStream  = new FileStream("C:\\Users\\test\\Desktop\\my.zip", FileMode.Open, FileAccess.Read);
           byte[] buffer = new byte[MultipleOfThree];
           int bytesRead = inputStream.Read(buffer, 0, buffer.Length);
           while(bytesRead > 0)
           {
              byte[] secondaryBuffer = new byte[buffer.Length];
              int secondaryBufferBytesRead = bytesRead;
              Array.Copy(buffer, secondaryBuffer, buffer.Length);
              bool isFinalChunk = false;
              Array.Clear(buffer, 0, buffer.Length);
              bytesRead = inputStream.Read(buffer, 0, buffer.Length);
              if(bytesRead == 0)
              {
                 isFinalChunk = true;
                 buffer = new byte[secondaryBufferBytesRead];
                 Array.Copy(secondaryBuffer, buffer, buffer.length);
              }

              String base64String = Convert.ToBase64String(isFinalChunk ? buffer : secondaryBuffer);
              File.AppendAllText("C:\\Users\\test\\Desktop\\Base64Zip", base64String); 
            }
            inputStream.Dispose();
        }

The decoding is more of the same. I use the size of the base64String variable above (which varies depending on the original buffer size that I test with), as the buffer size for decoding. Then, instead of Convert.ToBase64String(), I call Convert.FromBase64String() and write to a different file name/path.

EDIT:

In my haste to reduce the code (I refactored it into a new project, separate from other processing to eliminate code that isn't central to the issue) I introduced a bug. The base 64 conversion should be performed on the secondaryBuffer for all iterations save the last (Identified by isFinalChunk), when buffer should be used. I have corrected the code above.

EDIT #2:

Thank you all for your comments/feedback. After correcting the bug (see the above edit), I re-tested my code, and it is actually working now. I intend to test and implement @rene's solution as it appears to be the best, but I thought that I should let everyone know of my discovery as well.

CaptainCobol
  • 133
  • 1
  • 1
  • 10
  • What are you doing with the secondary buffer and `isFinalChunk`? It looks like you're calling `ToBase64String` on an cleared buffer unless it's the final chunk. – Blorgbeard Sep 21 '15 at 19:42
  • 3
    Problem may be in code that converts files back from base64 to binary file. Do you read character in chunks of four or Mulitple of Four. – Vova Sep 21 '15 at 19:44
  • @Blorgbeard - I am using the secondaryBuffer to hold the contents of the first/current read from the file. Then I read again, looking for a return of '0' to indicate that I am processing the final chunk. The final chunk is resized so that it is only large enough to hold the data that is being encoded. E.g. - if the buffer was set at 600,000, but the last read is 1000 bytes long, there is no need to pass a byte[] containing 600,000 elements. If I am not on the final chunk, then I process `secondaryBuffer` instead, which contains the required data. – CaptainCobol Sep 21 '15 at 20:34
  • @Vova - I use the size of the chunks that were created during encoding. If the block size was 262,144, it would yield Base64Strings that were 349,528 characters long, so I would use 349,528 as the buffer size when decoding. – CaptainCobol Sep 21 '15 at 20:36
  • 1
    @CaptainCobol there's no overhead to passing a large array, it's still just a reference you're passing. You can pass index and offset as per my answer to avoid reprocessing old data, and then you can eliminate the secondary buffer. – Blorgbeard Sep 21 '15 at 20:50
  • Light bulb! Converting to Base64 in blocks is sensitive to the size of the block. If the block size is not a multiple of 6 bits, the last character of the block will probably not be correct because it doesn't encode all the data; it won't include the final few bits from the next block. Also, an odd block size will produce a greater number of characters because the conversion will add a character per block. TLDR: an odd block size _will_ result in corrupted data. The data must be encoded from a continuous stream of bits unless the block size is a multiple of 6 bits. – Suncat2000 Aug 02 '19 at 12:56

3 Answers3

15

Based on the code shown in the blog from Wiktor Zychla the following code works. This same solution is indicated in the remarks section of Convert.ToBase64String as pointed out by Ivan Stoev

// using  System.Security.Cryptography

private void ConvertLargeFile()
{
    //encode 
    var filein= @"C:\Users\test\Desktop\my.zip";
    var fileout = @"C:\Users\test\Desktop\Base64Zip";
    using (FileStream fs = File.Open(fileout, FileMode.Create))
        using (var cs=new CryptoStream(fs, new ToBase64Transform(),
                                                     CryptoStreamMode.Write))

           using(var fi =File.Open(filein, FileMode.Open))
           {
               fi.CopyTo(cs);
           }
     // the zip file is now stored in base64zip    
     // and decode
     using (FileStream f64 = File.Open(fileout, FileMode.Open) )
         using (var cs=new CryptoStream(f64, new FromBase64Transform(),
                                                     CryptoStreamMode.Read ) ) 
           using(var fo =File.Open(filein +".orig", FileMode.Create))
           {
               cs.CopyTo(fo);
           }     
     // the original file is in my.zip.orig
     // use the commandlinetool 
     //  fc my.zip my.zip.orig 
     // to verify that the start file and the encoded and decoded file 
     // are the same
}

The code uses standard classes found in System.Security.Cryptography namespace and uses a CryptoStream and the FromBase64Transform and its counterpart ToBase64Transform

Community
  • 1
  • 1
rene
  • 41,474
  • 78
  • 114
  • 152
  • 3
    This is indeed the right answer! The MSDN documentation for `Convert.ToBase64String` method (https://msdn.microsoft.com/en-us/library/s70ad5f6(v=vs.100).aspx) contains **Important** notice in the **Remarks** section recommending just that. – Ivan Stoev Sep 21 '15 at 20:39
10

You can avoid using a secondary buffer by passing offset and length to Convert.ToBase64String, like this:

private void ConvertLargeFile()
{
    using (var inputStream  = new FileStream("C:\\Users\\test\\Desktop\\my.zip", FileMode.Open, FileAccess.Read)) 
    {
        byte[] buffer = new byte[MultipleOfThree];
        int bytesRead = inputStream.Read(buffer, 0, buffer.Length);
        while(bytesRead > 0)
        {
            String base64String = Convert.ToBase64String(buffer, 0, bytesRead);
            File.AppendAllText("C:\\Users\\test\\Desktop\\Base64Zip", base64String); 
            bytesRead = inputStream.Read(buffer, 0, buffer.Length);           
        }
    }
}

The above should work, but I think Rene's answer is actually the better solution.

Community
  • 1
  • 1
Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
  • Does `stream.Read` clear the input buffer prior to reading? If you're requesting 3 and it reads 2, does the last byte hold an old value? – Dave Zych Sep 21 '15 at 19:55
  • 3
    @DaveZych: It doesn't clear it, but that doesn't matter, since you pass `offset` and `length` to the `Convert.ToBase64String` method – Ivan Stoev Sep 21 '15 at 20:08
  • @DaveZych, you are correct, and that is why I clear the buffer before every new read. – CaptainCobol Sep 21 '15 at 20:37
  • 1
    @CaptainCobol Ivan is also correct, this code avoids passing old data by telling `Convert.ToBase64String` to only process bytes that were read this iteration. – Blorgbeard Sep 21 '15 at 20:52
  • 1
    @Blorgbeard - Yes, I agree. This is a much cleaner version. I originally overlooked the change in the Convert parameters. – CaptainCobol Sep 21 '15 at 20:58
1

Use this code:

public void ConvertLargeFile(string source , string destination)
{
    using (FileStream inputStream = new FileStream(source, FileMode.Open, FileAccess.Read))
    { 

        int buffer_size = 30000; //or any multiple of 3

        byte[] buffer = new byte[buffer_size];
        int bytesRead = inputStream.Read(buffer, 0, buffer.Length);
        while (bytesRead > 0)
        {
            byte[] buffer2 = buffer;

            if(bytesRead < buffer_size)
            {
                buffer2 = new byte[bytesRead];
                Buffer.BlockCopy(buffer, 0, buffer2, 0, bytesRead);
            }

            string base64String = System.Convert.ToBase64String(buffer2);
            File.AppendAllText(destination, base64String);

            bytesRead = inputStream.Read(buffer, 0, buffer.Length);

        }
    }
}
Yacoub Massad
  • 27,509
  • 2
  • 36
  • 62
  • Buffer.BlockCopy is not safe in this scenario. I was using it originally, but I found that my copy arrays were half filled with nulls. See: http://stackoverflow.com/a/1390023/4659717 – CaptainCobol Sep 21 '15 at 20:44
  • Why is it not safe? I am sure it is. Anyway, I noticed the answer by Blorgbeard is actually better, it does the same as I do, except it does not use Buffer.BlockCopy, it uses another overload of the ToBase64String method. – Yacoub Massad Sep 21 '15 at 20:47
  • If you follow the link that I provided, MusiGenesis explains it well. My arrays were half filed-with content, and half-filled with nulls. Buffer.BlockCopy parameters are byte-based, rather than index-based. – CaptainCobol Sep 21 '15 at 20:51
  • After viewing the link, I am guessing the link you provided is talking about general cases where the programmer will try to copy complex object types (like structs for example). If your source and destination are byte[], then it is perfectly safe to use Buffer.BlockCopy. In byte[], 1 index = 1 byte. In structs, 1 index might be > 1 byte. – Yacoub Massad Sep 21 '15 at 20:57
  • If you take my code and replace calls to `Array.Copy()` with `Buffer.BlockCopy()`, you should see what I mean. – CaptainCobol Sep 21 '15 at 21:07
  • But your code (in the question) already has an issue, If I use Buffer.BlockCopy in it, I will not see any difference. – Yacoub Massad Sep 21 '15 at 21:12