6

I have a large list of objects that I need to store and retrieve later. The list will always be used as a unit and list items are not retrieved individually. The list contains about 7000 items totaling about 1GB, but could easily escalate to ten times that or more.

We have been using BinaryFormatter.Serialize() to do the serialization (System.Runtime.Serialization.Formatters.Binary.BinaryFormatter). Then, this string was uploaded as a blob to Azure blob storage. We found it to be generally fast and efficient, but it became inadequate as we are testing it with a greater file size, throwing an OutOfMemoryException. From what I understand, although I'm using a stream, my problem is that the BinaryFormatter.Serialize() method must first serialize everything to memory before I can upload the blob, causing my exception.

The binary serializer looks as follows:

public void Upload(object value, string blobName, bool replaceExisting)
{
    CloudBlockBlob blockBlob = BlobContainer.GetBlockBlobReference(blobName);
    var formatter = new BinaryFormatter()
    {
        AssemblyFormat = FormatterAssemblyStyle.Simple,
        FilterLevel = TypeFilterLevel.Low,
        TypeFormat = FormatterTypeStyle.TypesAlways
    };

    using (var stream = blockBlob.OpenWrite())
    {
        formatter.Serialize(stream, value);
    }
}

The OutOfMemoryException occurs on the formatter.Serialize(stream, value) line.

I therefore tried to using a different protocol, Protocol Buffers. I tried using both the implementations in the Nuget packages protobuf-net and Google.Protobuf, but the serialization was horribly slow (roughly 30mins) and, from what I have read, Protobuf is not optimized for serializing data larger than 1MB. So, I went back to the drawing board, and came across Cap'n Proto, which promised to solve my speed issues by using memory mapping. I am trying to use @marc-gravell 's C# bindings but I am having some difficulty implementing a serializer, as the project does not have thorough documentation yet. Moreover, I'm not 100% sure that Cap'n Proto is the correct choice of protocol - but I am struggling to find any alternative suggestions online.

How can I serialize a very large collection of items to blob storage, without hitting memory issues, and in a reasonably fast way?

08Dc91wk
  • 4,254
  • 8
  • 34
  • 67
  • Upload batches to more than one blob rather than serializing everything at once? – spender Apr 07 '16 at 13:21
  • Thanks, that is an option I'm considering. Each list is already a chunk in our domain though so the blobs would lose context somewhat and it would complicate matters a bit. Good suggestion though, I will give it a shot if there aren't any other protocol or method suggestions. – 08Dc91wk Apr 07 '16 at 13:31

2 Answers2

1

Perhaps you should switch to JSON?

Using the JSON Serializer, you can stream to and from files and serialize/deserialize piecemeal (as the file is read).

Would your objects map to JSON well?

This is what I use to take a NetworkStream and put into a Json Object.

    private static async Task<JObject> ProcessJsonResponse(HttpResponseMessage response)
    {
        // Open the stream the stream from the network
        using (var s = await ProcessResponseStream(response).ConfigureAwait(false))
        {
            using (var sr = new StreamReader(s))
            {
                using (var reader = new JsonTextReader(sr))
                {
                    var serializer = new JsonSerializer {DateParseHandling = DateParseHandling.None};

                    return serializer.Deserialize<JObject>(reader);
                }
            }
        }
    }

Additionally, you could GZip the stream to reduce the file transfer times. We stream directly to GZipped JSON and back again.

Edit, although this is a Deserialize, the same approach should work for a Serialize

James Woodall
  • 725
  • 7
  • 15
0

JSON serialization can work, as the previous poster mentioned, although one a large enough list, this was also causing OutOfMemoryException exceptions to be thrown because the string was simply too big to fit in memory. You might be able to get around this by serializing in pieces if your object is a list, but if you're okay with binary serialization, a much faster/lower memory way is to use Protobuf serialization.

Protobuf has faster serialization than JSON and requires a smaller memory footprint, but at the cost of it being not human readable. Protobuf-net is a great C# implementation of it. Here is a way to set it up with annotations and here is a way to set it up at runtime. In some instances, you can even GZip the Protobuf serialized bytes and save even more space.

Norrec
  • 531
  • 4
  • 17