1

I am reading code for a project written by others. The main task of the project is to read contents from a large structured text file (.txt) with 8 columns into a KnowledgeBase object, which have a number of methods and variables. The KnowledgeBase object is then output into a binary file. For example, the KnowledgeBase class has at least these two variables:

map<string, pair<string, string>> key_info
vector<ObjectInfo> objects
...

These variables are easy to understand when I track the code with gdb. Then, it seems it is converting such vectors and maps into binary forms. And the two variables above have their corresponding binary forms:

BinaryKeyInfo *bkeys
BinaryObjectInfo *bObjects

Later on when outputting to binary file, it has such code:

fwrite((char*)(&wcount),sizeof(int32_t),1,output);
fwrite((char*)bkeys,sizeof(KeyInfo_t),wcount,output);

The converting code from the original KnowledgeBase to binary is complicated. My question is, what's the main purpose of this conversion? Is it for faster loading of binary file into memory than plain text file? The plain text file is large. I learnt that object serialization is primarily for transmitting objects over the net, but I don't think the purpose here is for that. It is more like for speeding up data loading and memory saving. Could that be part of object serialization in C++?

marlon
  • 6,029
  • 8
  • 42
  • 76
  • 2
    The "main purpose" of this conversion is to accomplish the given task in a manner that actually works and does the job. There is no hidden meaning. C++ is hard. The C++ library is bare bones, and provides only basic functionality. When it comes to anything of any complexity, a C++ program must do all the work by itself. – Sam Varshavchik Dec 07 '21 at 00:15
  • For further ideas about serialization beyond transmission, see https://isocpp.org/wiki/faq/serialization – Passerby Dec 07 '21 at 00:20
  • Serialization broadly is about taking some collection of data, translating it through some kind of byte stream, and recovering it at the other end. It might be binary, it might be text. It might be crazy fast, or it might be horrendously slow. The point is that the resulting structure before serialization is logically the same structure after deserialization. – paddy Dec 07 '21 at 00:20
  • Let's take a simple example: text. Text is a variable length record. Two popular schemas are: 1) Write text until a terminator; 2) Write the length first, then the text. Option 2 is faster on the read, because after you read the length, you know how much memory to allocate and you can block read the data. Other structures that involve pointers need a schema because pointers are not portable into files. – Thomas Matthews Dec 07 '21 at 00:28

1 Answers1

4

Is the main purpose of object serialization in C++ for faster object loading?

No. The most important purpose of serialisation is to transform the state of the program into a format that can be stored on the filesystem, or that can be communicated across a network, and that can be de-serialised back. Often, the purpose of either is for another program to do the de-serialisation. Sometimes the de-serialiser is another instance of the same program.

The speed of de-serialisation is one metric that can be used to gauge whether one particular serialisation format is a good one. The ability to quickly undo what you have done is not the reason why you do it in the first place.

what's the benefit of converting them into binary vectors or maps?

As I mention above, the benefit of serialisation is the ability to store the serialised data on the filesystem, or to send it over a network.

what' the benefit between plain text files VS binary files?

Pros of text serialisation format:

  • Humans are able to read and write plain text. Humans generally are not able read nor write binary files.
  • It's generally easier to implement a plain text format de-/serialiser in a way that works across differing computers than it is to implement a binary format de-/serialiser that achieves the same.

Pros of binary serialisation format:

  • Typically faster and uses less storage and bandwidth.
  • Can be easier to implement if there is no need for communication between differing systems. This is typically only the case in very simple cases. (Furthermore, there usually is a need for cross-system compatibility, even if the need haven't been realised yet).
eerorika
  • 232,697
  • 12
  • 197
  • 326
  • What I don't understand is that, the data is initially already loaded into 'plain' vectors and maps and can be efficiently accessed in memory, then what's the benefit of converting them into binary vectors or maps? Will they be even more efficient operated in memory? 2)"a format that can be stored on the filesystem", what' the benefit between plain text files VS binary files? If binary files doesn't benefit, the plain text file is already "can be stored on filesystem" (could be communicated over network too?). – marlon Dec 07 '21 at 00:37
  • 1
    @marlon `map`s and `vector`s are objects that don't directly contain data. They hold pointers that reference the actual data. If you write a `map` to a file with the `write` function, the pointers, not the pointed-at data, and a few book-keeping variables are saved in the file. None of the data you actually care about goes to the file. The pointers are of no use to you and are usually fatal to use. – user4581301 Dec 07 '21 at 00:46
  • I am digesting little by little. – marlon Dec 07 '21 at 01:03
  • 'Typically faster and uses less storage and bandwidth.', 'Typically faster' means loading binary file and populating data structures faster (typically at initialization stage), or faster at post-initialization stage (not sure it is faster then plain vector and maps)? If it is the later, then probably that's the purpose of binary usage in this project. – marlon Dec 07 '21 at 01:08
  • @marlon I don't know what you mean by "post-initialization stage". There should be no difference between speeds of a vector or a map that have been de-serialised from a binary format versus having been de-serialised from a text format, given that the resulting vector/map is identical in either case. – eerorika Dec 07 '21 at 01:10
  • @user4581301 My data is already in plain text file, I am trying to understand the benefit of binary file format and serilization. I don't write the raw map into file, but write the content of map into plain text file by iterating the key-value pair in the usual way. – marlon Dec 07 '21 at 01:10
  • @marlon If the data is serialised in a plain text file, it will likely use much more storage, and be slower to read and write, than if the data is serialised in a binary file instead. – eerorika Dec 07 '21 at 01:12
  • @marlon Got you. A well-written and well-optimized binary file is often smaller and faster than a plaintext file if the data being stored is light on strings and heavy on stuff that benefits from binary form like numbers. Getting to that point usually soaks more programmer and maintainer time, and that's often worth its weight in gold. – user4581301 Dec 07 '21 at 01:14
  • The vectors before and after serialisation in this project are not exactly the same. For example, in the binary object, it usually has variables like 'int32_t startIndex; int32_t endIndex ''. Probably are they byte position in the object? Binary object like this are very opaque than plain vectors. – marlon Dec 07 '21 at 01:15
  • An example, I wrote a hack text comm protocol for use when developing the management database for a supercomputer system. The intent was to replace it with a high-speed binary format, but A) it was fast enough and B) there were always more important (and interesting) problems to solve. – user4581301 Dec 07 '21 at 01:17
  • @eerorika " it will likely use much more storage, and be slower to read and write," probably this is the reason in this project. But this happens at initialization stage, so I doubt the benefit of this binary approach, since it makes the code a lot more opaque and hard to follow. – marlon Dec 07 '21 at 01:18
  • @marlon sometimes people spend a lot of time optimizing processes they didn't need to, but maybe the powers above demanded a fast start-up. – user4581301 Dec 07 '21 at 01:31