I have a large list of objects that I need to store and retrieve later. The list will always be used as a unit and list items are not retrieved individually. The list contains about 7000 items totaling about 1GB, but could easily escalate to ten times that or more.
We have been using BinaryFormatter.Serialize()
to do the serialization (System.Runtime.Serialization.Formatters.Binary.BinaryFormatter
). Then, this string was uploaded as a blob to Azure blob storage. We found it to be generally fast and efficient, but it became inadequate as we are testing it with a greater file size, throwing an OutOfMemoryException
. From what I understand, although I'm using a stream, my problem is that the BinaryFormatter.Serialize()
method must first serialize everything to memory before I can upload the blob, causing my exception.
The binary serializer looks as follows:
public void Upload(object value, string blobName, bool replaceExisting)
{
CloudBlockBlob blockBlob = BlobContainer.GetBlockBlobReference(blobName);
var formatter = new BinaryFormatter()
{
AssemblyFormat = FormatterAssemblyStyle.Simple,
FilterLevel = TypeFilterLevel.Low,
TypeFormat = FormatterTypeStyle.TypesAlways
};
using (var stream = blockBlob.OpenWrite())
{
formatter.Serialize(stream, value);
}
}
The OutOfMemoryException occurs on the formatter.Serialize(stream, value)
line.
I therefore tried to using a different protocol, Protocol Buffers. I tried using both the implementations in the Nuget packages protobuf-net and Google.Protobuf, but the serialization was horribly slow (roughly 30mins) and, from what I have read, Protobuf is not optimized for serializing data larger than 1MB. So, I went back to the drawing board, and came across Cap'n Proto, which promised to solve my speed issues by using memory mapping. I am trying to use @marc-gravell 's C# bindings but I am having some difficulty implementing a serializer, as the project does not have thorough documentation yet. Moreover, I'm not 100% sure that Cap'n Proto is the correct choice of protocol - but I am struggling to find any alternative suggestions online.
How can I serialize a very large collection of items to blob storage, without hitting memory issues, and in a reasonably fast way?