2

I'm using protobuf to serialize large objects to binary files to be deserialized and used again at a later date. However, I'm having issues when I'm deserializing some of the larger files. The files are roughly ~2.3 GB in size and when I try to deserialize them I get several exceptions thrown (in the following order):

I've looked at the question referenced in the second exception, but that doesn't seem to cover the problem I'm having.

I'm using Microsoft's HPC pack to generate these files (they take a while) so the serialization looks like this:

   using (var consoleStream = Console.OpenStandardOutput())
   {
            Serializer.Serialize(consoleStream, dto);
   }

And I'm reading the files in as follows:

    private static T Deserialize<T>(string file)
    {
        using (var fs = File.OpenRead(file))
        {
            return Serializer.Deserialize<T>(fs);
        }
    }

The files are two different types. One is about 1GB in size, the other about 2.3GB. The smaller files all work, the larger files do not. Any ideas what could be going wrong here? I realise I've not given a lot of detail, can give more as requested.

Community
  • 1
  • 1
geekchic
  • 1,526
  • 1
  • 11
  • 21
  • 2
    *Deserialization* and *2.3 GB* already sounds wrong. Disregards errors, the idea of using any kind of serialization for such *huge* amount of data is bad. Could you elaborate what exactly problem you are trying to solve by using serialization? – Sinatr May 09 '14 at 11:15
  • @Sinatr Yeah, I've kinf of realised that perhaps this wasn't the best route, but I have the files now so trying to salvage them. I need to be able to generate these files and save them to disk for use later. – geekchic May 09 '14 at 11:20
  • What use? Could you tell exactly what are these files? Maybe you decide to transfer (export/import?) data by using serialization or something else, where serialization (for such amount of data) is a bad idea. Consider to use custom file format, there huge data (HPC pack? what is that?) is just copied 1 to 1, while small part (containing configuration, paths, parameters, etc) is serialized in classic way and then combined with huge data. – Sinatr May 09 '14 at 11:27
  • @geekchic I have to confess, my unit test suite doesn't extend to multi-GB files. It is possible that this is simply a reader issue relating to an `int` that perhaps should be a `long`; I will have to find a moment to investigate. – Marc Gravell May 09 '14 at 11:29
  • @MarcGravell but you have to admit: It's a cool bug! And a case for the checked-arithmetic compiler option, maybe. – usr May 09 '14 at 11:54
  • @MarcGravell I have stack traces for all three bugs if you want them - didn't want to make the question an unreadable wall of text. – geekchic May 09 '14 at 12:42
  • @Sinatr The files are results of fairly intensive mathematical models. Each takes about 8 hours to generate. I suppose I could write a library to write these objects into a custom file format, but I just thought that since there are libraries out there that can turn my object into binary I might as well try those. – geekchic May 09 '14 at 12:44
  • Still bad idea. What are your mathematical models? Array points? Store them in database to example, or define own custom format again (it's pretty easy task to save/read binary file). I have feeling you do not really need serialization here. Serialization is a process of transforming data (for whatever reason), perhaps the fastest way in your case would be to organize data in a way, that you don't need additional transformation or it will become trivial. Example, game saves, the process could be long (with conversion) or flash-quick (dumping and zipping). – Sinatr May 09 '14 at 13:03
  • @geekchic any chance you could email the stack traces to me? see my profile page (click my name ==>) – Marc Gravell May 09 '14 at 13:07

1 Answers1

1

Here I need to refer to a recent discussion on the protobuf list:

Protobuf uses int to represent sizes so the largest size it can possibly support is <2G. We don't have any plan to change int to size_t in the code. Users should avoid using overly large messages.

I'm guessing that the cause of the failure inside protobuf-net is basically the same. I can probably change protobuf-net to support larger files, but I have to advise that this is not recommended, because it looks like no other implementation is going to work well with such huge data.

The fix is probably just a case of changing a lot of int to long in the reader/writer layer. But: what is the layout of your data? If there is an outer object that is basically a list of the actual objects, there is probably a sneaky way of doing this using an incremental reader (basically, spoofing the repeated support directly).

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • There is an outer object that has a couple of (rather large) dictionaries on it. How exactly would a reader like that work? – geekchic May 14 '14 at 10:45
  • Basically, using an outer reader to read the lengths of items, and an inner reader to read them; however, looking at the code, it *might* be a case of updating the `ProtoReader.position` and `ProtoReader.blockEnd` to be `long`, making sure to also change any related `int.MaxValue` to `long.MaxValue`. Do you perhaps have an example generator or example somewhere? I can of course try to write a multi-GB generator, but no harm in asking... – Marc Gravell May 14 '14 at 11:08
  • Afraid I don't, the code (and data) belongs to my employer. I guess in the meantime I'll regenerate and break the files up into smaller chunks. – geekchic May 14 '14 at 11:11
  • @geekchic it really depends how urgently you need it; I can try to take a look, but balancing time between lots of projects is always fun... – Marc Gravell May 14 '14 at 11:13