Loading large TIF file into string with .NET in a memory-efficient way

Question

I have existing code that has been used for years to upload an XML and TIF file pair via an HttpWebRequest POST request. Problem is, on large TIF files it chews through memory like a flock of beavers attacking a forest. I started digging into the code today in an attempt to make it more memory-efficient.

The existing code loads XML and TIF content into a string object, which is then converted into a byte array and fed into the HTTP request. Many string concatenations are involved throughout. TIF file is loaded and converted to string object like this, where br2 is a BinaryReader object:

System.Text.Encoding.Default.GetString(br2.ReadBytes(tifByteCount))

I now know that using Encoding.Default is not wise, but changing that will require working with the client to change their decoding of the file submissions, so that is for another time. I will likely change to base64 encoding when I make that change. Anyway...

The first item I changed was all of my string concatenations, because I figured that was bogging things down, especially when working with the TIF-string object. I'm using a StringBuilder object now and appending everything.

I then searched for "byte array to string conversion" and tried several different results that I found, including this one and this one, but both used a different encoding than my existing code.

I then used the System.Text.Encoding.Default.Decoder object to decode the entire TIF file into a char[] array at one time. That didn't improve the memory at all, but did at least use the same encoding.

The file I've been testing with today is a 185 MB TIF file. While testing on my dev machine, my Windows physical memory usage would start at 2 GB used, and would quickly climb to 5+ GB and then max out at 5.99 GB and promptly lock up until the debugger killed itself. As far as I could tell I was only loading a single instance of the TIF file into memory, so I couldn't understand why 185 MB was using up 4 GB of memory.

Anyway, next I tried loading in the TIF file in much smaller chunks. 1000 bytes at a time. This looked promising initially. It only used 2 GB of memory when loading all but the last <1000 bytes of the file. On the last chunk of bytes though (in this case 928 bytes), this line charCount = dc.GetCharCount(ba2, x, (int)fileStream2.Length - x) caused the memory to momentarily spike by 1 GB, the following line chars2 = new Char[(int)fileStream2.Length - x] increased memory by 700 MB, and the following line charsDecodedCount = dc.GetChars(ba2, x, (int)fileStream2.Length - x, chars2, 0) pushed the memory to the max and locked up the system.

The code below shows the last approach tried - the one described in the previous paragraph.

BinaryReader br2 = new BinaryReader(fileStream2);
byte[] ba2 = br2.ReadBytes((int)fileStream2.Length);
Char[] chars2 = null;

if ((int)fileStream2.Length > 1000)
{
    for (int x = 0; x < (int)fileStream2.Length; x += 1000)
    {
        if (x + 1000 > (int)fileStream2.Length)
        {
            charCount = dc.GetCharCount(ba2, x, (int)fileStream2.Length - x);
            chars2 = new Char[(int)fileStream2.Length - x];
            charsDecodedCount = dc.GetChars(ba2, x, (int)fileStream2.Length - x, chars2, 0);
        }
        else
        {
             charCount = dc.GetCharCount(ba2, x, 1000);
             chars2 = new Char[charCount];
             charsDecodedCount = dc.GetChars(ba2, x, 1000, chars2, 0);
        }

        sbRequest.Append(chars2);
        chars2 = null;
    }
}
else
{
    charCount = dc.GetCharCount(ba2, 0, ba2.Length);
    chars2 = new Char[charCount];
    charsDecodedCount = dc.GetChars(ba2, 0, ba2.Length, chars2, 0);
    sbRequest.Append(chars2);
}

I have a feeling I'm missing something fairly obvious. I'd appreciate any advice on resolving this. I'd like to be able to load in a 185 MB TIF file without using 4 GB of memory!

Your just taking the raw bits of a TIFF file and running them through a text encoder to produce a UTF-16 encoded string (what a string is internally in the .Net world) that you then use as the request body in an HTTP Post? That seems...complicated. Not to mention fraught with possibilities of trashing data. Not all byte sequences are [valid UTF-8](https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) or [UTF-16 encodings](http://unicode.org/faq/utf_bom.html#utf16-7). The encoder or decoder could throw an exception, junk the data, etc. — Nicholas Carey, Jul 10 '15 at 23:44
When I first developed this some years ago, it was based on some working java code that our client provided us with. I knew very little about java, and was still rather new to C#. I took their java code and converted it to c# as best as I could at the time, and once I had it working I never went back and tried to improve on it...until today. @Blam, I thought I had to load it into memory to put it in the output stream. It seems I was way off in thinking that. — Jesse, Jul 11 '15 at 03:24
@Nicholas, I hear what you're saying. I'm wondering if our client has even been using any of these TIF files since we've uploaded over 11,000 in the last two years alone, and all using the Default Encoding to convert the TIF to string, and they've not complained about any of them being corrupt. — Jesse, Jul 11 '15 at 03:35

cbr · Answer 1 · 2015-07-10T23:18:17.920

Few major issues in your current code:

byte[] ba2 = br2.ReadBytes((int)fileStream2.Length);

This will read the entire file to the memory.

dc.GetCharCount(...)
dc.GetChars(...)

These methods use internal buffers so they'll be increasing the memory usage even more like you said.

You're not "loading in the TIF file in much smaller chunks 1000 bytes at a time". You're loading the entire file to your memory and decoding the bytes 1000 bytes at a time.

If you really want to make your method use as little memory as possible, I suggest just work with the streams. Here's an example:

using (var fs = new FileStream("tif file", FileMode.Open))
{
    var request = (HttpWebRequest)WebRequest.Create("address");
    request.Method = "POST";
    request.ContentLength = fs.Length;

    using (Stream postStream = request.GetRequestStream())
    {
        // Write the other contents you wanted to write here
        // ...

        // CopyTo uses a buffer of 4096 bytes by default, so it will
        // only read 4096 bytes into memory at a time.
        fs.CopyTo(postStream);
        postStream.Close(); // Not sure if necessary since we're in a using block
    }

    using (HttpWebResponse response = request.GetResponse()) // might need to cast to HttpWebRequest
    using (Stream responseStream = response.GetResponseStream())
    using (var streamReader = new StreamReader(responseStream))
    {
        string response = Encoding.UTF8.GetString(streamReader.ReadToEnd());
        // Stuff with the response
    }
}

You may get better performance from your large read filestream by specifying FileOptions.SequentialScan in its constructor. This flag "Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching." ^[1] You can find further details as to what the flag does here.

using (var fs = new FileStream("tif file", FileMode.Open, FileAccess.Read, FileShare.Read, 4096, FileOptions.SequentialSeek))

score 0 · Answer 2 · answered Jul 11 '15 at 00:38

Posting a TIFF and an XML document in a single request is easily accomplished using MIME multipart. See example below, which has a memory usage measured in kilobytes - regardless of file size:

var content = new MultipartFormDataContent();

var tiffFile = new StreamContent(File.OpenRead("demo.tiff"));
tiffFile.Headers.ContentType = new MediaTypeHeaderValue("image/tiff");
content.Add(tiffFile, "image");

var xml = "<x>foo</x>";
var xmlContent = new StringContent(xml, Encoding.UTF8, "application/xml");
content.Add(xmlContent, "metadata");

var response = (new HttpClient()).PostAsync("http://target/service", content).Result;
response.EnsureSuccessStatusCode();

This will post the content to the server as:

POST http://target/service HTTP/1.1
Content-Type: multipart/form-data; boundary="5c3654f8-8e3c-4454-921a-36e0f7761265"
Host: target
Content-Length: 157289445
Expect: 100-continue
Connection: Keep-Alive

--5c3654f8-8e3c-4454-921a-36e0f7761265
Content-Type: image/tiff
Content-Disposition: form-data; name=image

«TIFF Data goes here»
--5c3654f8-8e3c-4454-921a-36e0f7761265
Content-Type: application/xml; charset=utf-8
Content-Disposition: form-data; name=metadata

<x>foo</x>
--5c3654f8-8e3c-4454-921a-36e0f7761265--

Loading large TIF file into string with .NET in a memory-efficient way

2 Answers2