How to store dynamic data (unknown number of fields) to a file?

Question

I need to store some data in a file. FOR THE MOMENT each record (data set) consists in:

a string (variable length),
an array of integers (variable length),
an array of bytes (variable length),
some integer values.

It wouldn't be difficult at all to save all these stuff in a binary file. However, I know for sure that (unfortunately) my data format will change in time, and I want to have the possibility to add more fields to each "record". So, obviously my file format cannot be fixed. I suppose that the best solution will be to save my data in (DB) table but I don't want to mess up with the big guns (SQL, ADO, BDE, Nexus...). I need a rudimentary library (if possible single PAS file) that can do that. Since the purpose of this is rather storing data than working with data, can it be done without a DB table?

Requirements for this library:

it needs to easily support more than 1 million rows
really lightweight
single PAS file if possible
MANDATORY: easy to install in a new machine (together with the project in which it compiles)
MANDATORY: in order to use it I don't need to redistribute anything
MANDATORY: in order to use it the user doesn't have to install/setup stuff
can be freeware/shareware
it doesn't have to support SQL queries or similar advanced stuff

I use D7

I would definitely design my own binary file format for this, but I am not sure I dare to make an answer of that... — Andreas Rejbrand, Feb 11 '11 at 18:10
What is the lifetime of this data. Are its consumers going to want to use it after you have ceased developing your tools? If so then a standards based format will make you offering more attractive. Perhaps not relevant to you in this instance, but you never know. — David Heffernan, Feb 11 '11 at 18:38
@david - The user does not have to know about this file. I use it internally. The user will use only my program to process the data. — Gabriel, Feb 11 '11 at 19:10
@Andreas R - Yes, I was thinking about the same. I just supposed that using an existing library will make the things easier (it seems this doesn't necessary have to be true). — Gabriel, Feb 11 '11 at 19:11
@Altar If it's the users data then they may be reluctant for it to be swallowed into a proprietary format. If this file is some intermediate file produced from the users' data then of course it's a non-issue. If the data is fleeting then it also doesn't matter. If the user wants to recover it in 20 years time then they ought to care! I expect you know all this though!!! — David Heffernan, Feb 11 '11 at 19:16
@Altar Many companies that have huge amounts of data locked in proprietary Office document formats are beginning to regret that now. Many don't care. All I'm saying is that for some types of data, and yours may not fall into that category, some users will prefer text data formats that can be parsed using standard parsers should they ever need to. But I know nothing about your users. I'm just flagging up something that is an issue for me and no doubt others. — David Heffernan, Feb 11 '11 at 20:22
@Altar I don't keep DCU files. I keep source files. I can't run the software that I had 20 years ago. But I would like to be able to read the data. 20 years ago I was running DOS. I still have Lotus Ami Pro files that I can't read. — David Heffernan, Feb 11 '11 at 20:23
@Altar Well I don't know your app at all, so I'm didn't know whether this huge table stores data that the user wants to keep for a long time, or whether it is transient and can be thrown away. And you confirm it is the latter. Fine. — David Heffernan, Feb 11 '11 at 22:32

Arnaud Bouchez · Answer 1 · 2011-02-12T12:30:14.920

Take a look at our Synopse Big Table unit.

With its recent upgrade, it could fit perfectly your need.

Here is how you create your field layout:

var Table: TSynBigTableRecord;
    FieldText, FieldInt: TSynTableFieldProperties;
begin
  Table := TSynBigTableRecord.Create('FileName.ext','TableName');
  FieldText := Table.AddField('text',tftWinAnsi,[tfoIndex]);
  FieldInt := Table.AddField('Int',tftInt32,[tfoIndex,tfoUnique]);
  Table.AddFieldUpdate;

For storing an array of bytes or Integers, just use the tftWinAnsi or even better tftBlobInternal kind of field (this is a true variable-length field), then map it to or from a dynamic array, just like a RawByteString.

You can safely add fields later, the data file will be processed for you.

There are several ways of handling the data, but I've implemented a variant-based way of using a record, with true late-binding:

var vari: Variant;

  // initialize the variant
  vari := Table.VariantVoid;
  // create record content, and add it to the database
  vari.text := 'Some text';
  vari.int := 12345;
  aID := Table.VariantAdd(vari);
  if aID=0 then
    ShowMessage('Error adding record');
  // how to retrieve it
  vari := Table.VariantGet(aID);
  assert(vari.ID=aID);
  assert(vari.INT=12345);
  assert(vari.Text='Some text');

About speed, you can't find anything faster IMHO. Creating 1,000,000 records with some text and an integer value, both fields using an index, and the integer field set as unique is less than 880 ms on my laptop. It will use very little disk space, because all storage is variable-length encoded (similar to Google's Protocol Buffers).

It needs only two units, works for Delphi 6 up to XE (and is already Unicode ready, so using this unit you could upgrade safely to a newer Delphi version when you want to). There is no installation need, and it's only a few KB added to your executable. It's just a small but powerful NoSQL engine written in pure Delphi, but with the power of use of a database (i.e. pure field layout) and the speed of a in-memory engine, with no limit in size.

And it's full OpenSource, with a permissive licence.

Note that we also provide a SQLite3 wrapper, but it's another project. Slower but more powerful, with SQL support and an integrated Client/Server ORM.

It took me quite a while until I figure out from where to download it. — Gabriel, Feb 11 '11 at 19:22
Another idea: our SQLite3 framework can be used without the SQLite3 engine. You can have your data loaded in memory, as a TList, then serialized as a JSON content on disk. And you'll have all ORM features of our framework. Define a TSQLRecord class type with published properties, then use a TSQLRestServerStaticInMemory instance to handle the data, in a local or client/server way. I think the JSON parser/producer implemented is the fastest available in pure Delphi. It will be in all cases faster than Delphi serialization (dfm like) or XML. — Arnaud Bouchez, Feb 12 '11 at 17:30

score 5 · Answer 2 · answered Feb 11 '11 at 18:01

5

Use Synopse BigTable, http://synopse.info/ a key=>value database, value in this case is the serialization of you data (json, binary, xml, ...).

This is CRAZY fast, lightweight and free.

answered Feb 11 '11 at 18:01

arthurprs

4,457
3
26
28

Synopse is written in Delphi? – Warren P Feb 11 '11 at 18:04
That's funny, I've just written the same answer... But more accurate, because I wrote this library, and I've just added field support. Thanks for your interest! :) – Arnaud Bouchez Feb 11 '11 at 18:13
@A.Bouchez I made it a wiki, feel free to add more info. – arthurprs Feb 11 '11 at 18:18
You posted yours when I was still typing my (long) answer. Thanks! :) – Arnaud Bouchez Feb 11 '11 at 18:52

score 4 · Accepted Answer · answered Feb 11 '11 at 17:45

4

I don't think you need a database for this. If you use a database I don't see how it solves the problem of your data structure changing.

I personally would store to YAML format which is very easily extensible. That requires quite a bit of work linking to some LIBYAML so a very lightweight alternative would be to store to INI files. These are easily extensible whilst maintaining compatibility with old files.

You can quite easily roll your own binary format that is extensible. What you do is you write each record to a block. Each block has a short header which includes its length.

When you read the data you read up to the end of the block and then if you are expecting more data you simply stop reading and use default values for the data. If you have read all the data you know about but are not at the end of the block, the file must have come from a later version of your program and you just skip to the end of the block. Perhaps you warn that the file contained data which you didn't know about.

Extensibility is achieved by always writing data out in the same order as previous versions. Any new data goes at the end of each block.

answered Feb 11 '11 at 17:45

David Heffernan

601,492
42
1,072
1,490

YAML, INI, XML or JSON need to use a parser, which could be slow for more than 1,000,000 rows of data. JSON parsers can be fast, but creating all objects in memory during loading is not immediate, and will create some memory fragmentation. – Arnaud Bouchez Feb 11 '11 at 18:22
@A.Bouchez I would have thought the I/O would be the bottleneck, even for YAML/JSON/XML, but you clearly know a lot more than I do on this one. I don't see why using YAML/JSON/XML implies fragmentation. It sounds like you have a very good tool though! – David Heffernan Feb 11 '11 at 18:35
1

There are 2 ways of using YAML/INI/XML/JSON: 1. Parsing and reading records one per one. 2. Loading all content in memory, creating one object for each parsed item. 1. Can be good for a logging mechanism, but it's not a DB-like approach. 2. Every information node will be created as a TObject in memory with strings within, therefore it will create a lot of small memory buffers... here comes the memory fragmentation. With SynBigTable you just create pointers to the binary memory mapped data, and never copy it. Of course, the *ML approach could work with modern HW. But it won't scale well. – Arnaud Bouchez Feb 11 '11 at 21:32
@A.Bouchez I think you can make a fragmentation free parser of highly regular table like *ML data but you need to work hard to do so. I can see that transferring compact binary and unpacking it on demand will be more efficient if you don't read all 1M records. It's certainly a very interesting topic! I do like the sound of your BigTable though, even if I'm not a huge fan of your SynGdiPlus!! ;-) – David Heffernan Feb 11 '11 at 21:44
@David SynGdiPlus is not at all a definitive GdiPlus library. It was designed for anti-aliaising rendering of any TMetaFile. Error handling for this feature is enough IMHO. At least no memory nor resource leak, and very fast process. It's used in production, without any problem (it's more complete than GDI+1.1). GdiPlus itself is a bit deprecated (it lacks some basic features of Gdi: no direct text rotation nor underline/strike, no RoundRect...). And Direct2D still suffers for performance problems (HW graphic drivers are not efficient yet). AggPas or Cairo could be used as a renderer. – Arnaud Bouchez Feb 12 '11 at 17:25
@A.Bouchez No worries! I solved my problem in the end with graphics32 which seems an excellent piece of work. – David Heffernan Feb 12 '11 at 17:29

Warren P · Answer 4 · 2011-02-11T19:23:46.417

1

In order of your level of effort to implement, I suggest, in this order:

CSV or INI files (TMemIniFile, or TJvCsvDataSet). This is the least work for you. You can support millions of lines in a single file, but the memory consumed will be enormous. I have envisioned a "data writer" component to replace my TJvCsvDataSet with something that only appends records, and does not load them into memory. This would allow you to write out to CSV files, and even read them back, row by row, but not load them all at once. This approach might be idea for you. A simple CSV reader/writer class that is NOT a dataset object.
one-XML-tag-per-line files. This is more flexible than INI files and can be hierarchical. INI files are non-hierarchical. Neither SAX or DOM are required if you simply open a file stream, and append a line of text, that is in this form, ending in cr+lf:

< logitem attrib1="value1" attrib2="value2" />
List item
Some kind of binary nosql db like bsddb, couchdb, etc.

edited Feb 11 '11 at 19:23

answered Feb 11 '11 at 18:01

Warren P

65,725
40
181
316

isn't XML slow when you are about to decode 10 million of data sets? – Gabriel Feb 11 '11 at 18:14
> and can be hierarchical - - - I don't know if this is relevant for me. – Gabriel Feb 11 '11 at 18:15
+1 for INI files. The original poster should test performance on large data sets, but if as he states, he is using this is for storing rather than working with the data, INI files might be fast enough. IMO, they're simplest to work with. – RobertFrank Feb 11 '11 at 18:56
I have appended data to memory files using TIniFile, which is much faster than using TMemIniFile, when the files get large. You can even use the Windows API WritePrivateProfileString, and log millions of lines this way. – Warren P Feb 11 '11 at 19:20
altar, what if you write your XML in a one-line-per-record without any closing tag? Then you never have to parse 1 million records. Just do a readline, and parse one line. It's sort of a custom XML-like CSV row logging system. – Warren P Feb 11 '11 at 19:21
@Altar You can make any row-defined table hierarchical, by using a reference to the ID of another row. – Arnaud Bouchez Feb 11 '11 at 21:35

score 1 · Answer 5 · answered Feb 12 '11 at 01:35

Any time you need to store variable length data into a binary format, you should store the data's length in front of the actual data.

Since you also need to add new fields later, you should store the number of fields per record (or at least an end-of-record marker at the end) so you can maintain the correct position while moving around the file during reading and seeking operations.

As for the actual record data, I would suggest a type-length-data format for each field so you can add new fields without knowing what their data types will be ahead of time, and to allow code to recognize and read/skip individual fields as needed regardless of content (ie, if an old app tries to read a file with newer fields, it can skip what it does not recognize).

In the end, you will end up with something like this as a start, which you can then expand on, optimize, etc as needed:

const
  cTypeUnknown     = $00;
  cTypeString      = $01;
  cTypeInteger     = $02;
  cTypeByte        = $03;

  cTypeArray        = $80;
  cTypeStringArray  = cTypeStringArray or cTypeArray;
  cTypeIntegerArray = cTypeIntegerArray or cTypeArray;
  cTypeByteArray    = cTypeByteArray or cTypeArray;

type
  Streamable = class
  public
    procedure Read(Stream: TStream); virtual; abstract;
    procedure Write(Stream: TStream); virtual; abstract;
  end;

  Field = class(Streamable)
  public
    function GetType: Byte; virtual; abstract;
  end;

  FieldClass = class of Field;

  StringField = class(Field)
  public
    Data: String;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  StringArrayField = class(Field)
  public
    Data: array of String;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  IntegerField = class(Field)
  public
    Data: Integer;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  IntegerArrayField = class(Field)
  public
    Data: array of Integer;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  ByteField = class(Field)
  public
    Data: Byte;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  ByteArrayField = class(Field)
  public
    Data: array of Byte;
    function GetType: Byte; override;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  AnyField = class(ByteArrayField)
  public
    Type: Byte;
    function GetType: Byte; override;
  end;

  Record = class(Streamable)
  public
    Fields: array of Field;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

  RecordArray = class(Streamable)
  public
    Records: array of Record;
    procedure Read(Stream: TStream); override;
    procedure Write(Stream: TStream); override;
  end;

procedure WriteByte(Stream: TStream; Value: Byte);
begin
  Stream.WriteBuffer(@Value, SizeOf(Byte));
end;

function ReadByte(Stream: TStream): Byte;
begin
  Stream.ReadBuffer(@Result, SizeOf(Byte));
end;

procedure WriteInteger(Stream: TStream; Value: Integer);
begin
  Stream.WriteBuffer(@Value, SizeOf(Integer));
end;

function ReadInteger(Stream: TStream): Integer;
begin
  Stream.ReadBuffer(@Result, SizeOf(Integer));
end;

procedure WriteString(Stream: TStream; Value: String);
var
  S: UTF8String;
begin
  S := UTF8Encode(Value);
  WriteInteger(Stream, Length(S));
  if Length(S) > 0 then
    Stream.WriteBuffer(S[1], Length(S));
end;

function ReadString(Stream: TStream): String;
var
  S: UTF8String;
begin
  SetLength(S, ReadInteger(Stream));
  if Length(S) > 0 then
    Stream.ReadBuffer(S[1], Length(S));
  Result := UTF8Decode(S);
end;

function StringField.GetType: Byte;
begin
  Result := cTypeString;
end;

procedure StringField.Read(Stream: TStream);
begin
  Data := ReadString(Stream);
end;

procedure StringField.Write(Stream: TStream);
begin
  WriteString(Data);
end;

function StringArrayField.GetType: Byte;
begin
  Result := cTypeStringArray;
end;

procedure StringArrayField.Read(Stream: TStream);
var
  I: Integer;
begin
  SetLength(Data, ReadInteger(Stream));
  for I := 0 to High(Data) do
    Data[I] := ReadString(Stream);
end;

procedure StringArrayField.Write(Stream: TStream);
var
  I: Integer;
begin
  WriteInteger(Stream, Length(Data));
  for I := 0 to High(Data) do
    WriteString(Stream, Data[I]);
end;

procedure IntegerField.GetType: Byte;
begin
  Result := cTypeInteger;
end;

procedure IntegerField.Read(Stream: TStream);
begin
  Assert(ReadInteger(Stream) == SizeOf(Integer));
  Data := ReadInteger(Stream);
end;

procedure IntegerField.Write(Stream: TStream);
begin
  WriteInteger(Stream, SizeOf(Integer));
  WriteInteger(Stream, Data);
end;

function IntegerArrayField.GetType;
begin
  Result := cTypeIntegerArray;
end;

procedure IntegerArrayField.Read(Stream: TStream);
var
  Num: Integer;
begin
  I := ReadInteger(Stream);
  Assert((I mod SizeOf(Integer)) == 0);
  SetLength(Data, I);
  if Length(Data) > 0 then
    Stream.ReadBuffer(Data[0], I * SizeOf(Integer));
end;

procedure IntegerArrayField.Write(Stream: TStream);
begin
  WriteInteger(Stream, Length(Data));
  if Length(Data) > 0 then
    Stream.WriteBuffer(Data[0], Length(Data) * SizeOf(Integer));
end;

procedure ByteField.GetType: Byte;
begin
  Result := cTypeByte;
end;

procedure ByteField.Read(Stream: TStream);
begin
  Assert(ReadInteger(Stream) == SizeOf(Byte));
  Data := ReadByte(Stream);
end;

procedure ByteField.Write(Stream: TStream);
begin
  WriteInteger(Stream, SizeOf(Byte));
  WriteByte(Stream, Byte);
end;

function ByteArrayField.GetType: Byte;
begin
  Result := cTypeByteArray;
end;

procedure ByteArrayField.Read(Stream: TStream);
begin
  SetLength(Data, ReadInteger(Stream));
  if Length(Data) > 0 then
    Stream.ReadBuffer(Data[0], Length(Data));
end;

procedure ByteArrayField.Write(Stream: TStream); override;
begin
  WriteInteger(Stream, Length(Data));
  if Length(Data) > 0 then
    Stream.WriteBuffer(Data[0], Length(Data));
end;

function AnyField.GetType: Byte;
begin
  Result := Type;
end;

procedure Record.Read(Stream: TStream);
const
  PlainTypes = array[1..3] of FieldClass = (StringField, IntegerField, ByteField);
  ArrayTypes = array[1..3] of FieldClass = (StringArrayField, IntegerArrayField, ByteArrayField);
var
  I: Integer;
  RecType, PlainType: Byte;
begin
  SetLength(Fields, ReadInteger(Stream));
  for I := 0 to High(Fields) do
  begin
    RecType := ReadByte(Stream);
    PlainType := RecType and (not cTypeArray);
    if (PlainType >= cTypeString) and (PlainType <= cTypeByte) then
    begin
      if (RecType and cTypeArray) <> cTypeArray then
        Fields[I] := PlainTypes[PlainType].Create
      else
        Fields[I] := ArrayTypes[PlainType].Create;
    end else
      Fields[I] := AnyField.Create;
    Fields[I].Read(Stream);
  end;
end;

procedure Record.Write(Stream: TStream)
var
  I: Integer;
begin
  WriteInteger(Stream, Length(Fields));
  for I := 0 to High(Fields) do
  begin
    WriteByte(Stream, Fields[I].GetType);
    Fields[I].Write(Stream);
  end;
end;

procedure RecordArray.Read(Stream: TStream);
var
  I: Integer;
begin
  SetLength(Records, ReadInteger(Stream));
  for I := High(Records) do
  begin
    Records[I] := Record.Create;
    Records[I].Read(Stream);
  end;
end;

procedure RecordArray.Write(Stream: TStream);
begin
  WriteInteger(Stream, Length(Records));
  for I := High(Records) do
    Records[I].Write(Stream);
end;

This is the roots of serialization. But your proposal is a bit verbose. Standard TPersistent serialization would make the trick. And an ORM would make it even better. — Arnaud Bouchez, Feb 12 '11 at 17:28
`TPersistent` by itself does not serialze anything. It merely opens the door for the DFM streaming system to do its work. I prefer writing custom serialization code, as it offers the most flexibility over the data format, and can be optimized as needed for particular data types. — Remy Lebeau, Feb 14 '11 at 07:21
You're right TPersistent has the RTTI enabled for published properties. This is its direct child TComponent which is serializable. About writing your own serialization code, this is a matter of taste, of course. But manual serialization tends to be a bit verbose. I don't do this since years. — Arnaud Bouchez, Feb 14 '11 at 13:57

score 0 · Answer 6 · answered Feb 11 '11 at 17:59

0

I can think of something like one ini- or XML file per "record", with INI files stored in a virtual file system like our SolFS. I don't know, however, what you mean by "easily support 1 mln rows", i.e. what operations must be supported. If you plan mainly random access to some smaller number of records, then it's not a big deal to parse the text files. In other case you might want to look at some binary format, such as binary XML. I can say that TMCDataTree class of our other product, MsgConnect, supports hierarchical ini files saved in binary format.

answered Feb 11 '11 at 17:59

Eugene Mayevski 'Callback

45,135
8
71
121

"easily support 1 mln rows" - - - It means to allow me to store more than 1 million data sets/records. So here comes the question: isn't XML slow when you are about to decode 1 million (or 10) of data sets? – Gabriel Feb 11 '11 at 18:13
@Altar *"isn't XML slow when you are about to decode 1 million (or 10) of data sets?"* - that was *the* question. If you need to read them all at once, then it will be slow, and so will be any text format (and probably most binary ones). If you just need to keep 1 mln available, and read them occasionally , then text file will be ok. That is what I mentioned in the answer. – Eugene Mayevski 'Callback Feb 11 '11 at 19:07

How to store dynamic data (unknown number of fields) to a file?

6 Answers6

Linked