0

I have been keeping a large set of data as TEXT records in a TEXT file:

yyyyMMddTHHmmssfff doube1 double2

However when I read it I need to parse each DateTime. This is quite slow for millions of records.

So, now I am trying it as a binary file which I created by serlializing my class.

That way I do not need to parse the DateTime.

    class MyRecord 
    {
           DateTime DT;
           double Price1;
           double Price2;
    }

            public byte[] SerializeToByteArray()
            {
                var bf = new BinaryFormatter();
                using (var ms = new MemoryStream())
                {
                    bf.Serialize(ms, this);
                    return ms.ToArray();
                }
            }

    MyRecord mr = new MyRecord();

    outBin = new BinaryWriter(File.Create(binFileName, 2048, FileOptions.None));

   for (AllRecords) //Pseudo
    {
        mr = new MyRecord(); //Pseudo
        outBin.Write(mr.SerializeToByteArray());
    }

The resulting binary is on average 3 times the size of the TEXT file.

Is that to be expected?

EDIT 1

I am exploring using Protbuf to help me:

I want to do this with using USING to fit my existing structure.

   private void DisplayBtn_Click(object sender, EventArgs e)
    {
        string fileName = dbDirectory + @"\nAD20120101.dat";

        FileStream fs = File.OpenRead(fileName);

        MyRecord tr;
        while (fs.CanRead)
        {

            tr = Serializer.Deserialize<MyRecord>(fs);

            Console.WriteLine("> "+ tr.ToString());

        }

    }

BUT after first record tr - full of zeroes.

ManInMoon
  • 6,795
  • 15
  • 70
  • 133
  • The default serializers output a lot of overhead. If you wrote your stuff to the binary writer manually, you'd have a lot less. – harold Apr 02 '14 at 12:39
  • This isn't just "a binary file" it's "a binary file using BinaryFormatter" - and a new BinaryFormatter for each record too, which will add extra overhead. That's the reason. – Jon Skeet Apr 02 '14 at 12:39
  • This has nothing to do with binary files per-se. This is specific to `BinaryFormatter`. Other binary formats will be efficient. – CodesInChaos Apr 02 '14 at 12:40
  • if you need a smart serialization solution you can take a look at ProtoBuf-net https://code.google.com/p/protobuf-net/ – BRAHIM Kamel Apr 02 '14 at 12:41
  • @K.B The Serialize appears to go well. But then `private void DisplayBtn_Click(object sender, EventArgs e) { string fileName = dbDirectory + @"\nAUDUSD20120101.dat"; FileStream fs = File.OpenRead(fileName); while (fs.CanRead) { MyRecord tr; tr = Serializer.Deserialize(fs); } }` I get error: No parameterless constructor found for MyRecord – ManInMoon Apr 02 '14 at 13:14
  • that's pretty simple deserialization will need default constructor so in your class add public MyRecord(){} please tell me if I should put this as an answer – BRAHIM Kamel Apr 02 '14 at 13:16
  • Yes please, it looks right for me. Can you explain the last bit again. Where do I put MyRecord(){} - I already have a class called that – ManInMoon Apr 02 '14 at 13:20
  • @K.B In the loop above I only get the first record. Calling Deserialize several just gets zeroes. – ManInMoon Apr 02 '14 at 13:25

3 Answers3

1

Your archive likely has considerable overhead serializing type information with each record.

Instead, make the whole collection serializable (if it isn't already) and serialize that in one go.

sehe
  • 374,641
  • 47
  • 450
  • 633
0

You are not storing a simple binary version of your DateTime, but an object representing those. That is much larger then simply storing your Date as Text.

If you create a class

class MyRecords
{
    DateTime[] DT;
    double[] Price1;
    double[] Price2;
}

And serialize that, it should be much smaller.

Also I guess DateTime still needs lots of space, so you can convert your DateTime to a Integer Unix Timestamp and store that.

Mathias
  • 1,470
  • 10
  • 20
0

As Requested by the OP.

the output is not a binary file it's binary serialization of instances plus an overhead of BinaryFormatter to allow deserialization later for this reason you get 3 times the file large than expected if you need a smart serialization solution you can take a look at ProtoBuf-net https://code.google.com/p/protobuf-net/

here you can find a link explaining how you can achieve this

 [ProtoContract]
Public class MyRecord 
    {   [ProtoMember(1)]
           DateTime DT;
         [ProtoMember(2)]
           double Price1;
          [ProtoMember(3)]
           double Price2;
    }  
Community
  • 1
  • 1
BRAHIM Kamel
  • 13,492
  • 1
  • 36
  • 47