file operation in binary vs text mode -- performance concern

Question

In many projects, I saw that data object/structure are written into file in binary mode, and then retrieve them back from the file in binary mode again.

I wonder why they do it in binary mode? Any performance difference between text and binary mode? If not, then when to use binary mode or text mode?

I suspect this is a duplicate of http://stackoverflow.com/questions/229924/difference-between-files-writen-in-binary-and-text-mode, but I am not sure. — jogojapan, Aug 16 '12 at 06:02
@jogojapan, pretty much. But that post doesn't fully answer my question. — Alcott, Aug 16 '12 at 06:06

score 22 · Answer 1 · answered Aug 16 '12 at 06:09

Binary is faster. Consider an integer stored in 32 bits (4 bytes), such as 123456. If you were to write this out as binary (which is how it is represented in the computer) it would take 4 bytes (ignoring padding between items for alignment in structures).

To write the number as text, it has to be converted to a string of characters (some overhead to convert and memory to store) and then written it out, it will take at least 6 bytes as there are 6 characters to respresent the number. This is not including any additional padding such as spaces for alignment or delimiters to read/seperate the data.

Now if you consider it you had several thousands of items, the additional time can add up and require more space, which would take longer to read in and then there is the additonal time to convert back to binary for storage after you have read the value into memory.

The advantage to text, is that it is much easier to read for persons, rather then trying to read binary data or hex dumps of the data.

I found your answer is more understandable. :-) – Alcott Aug 16 '12 at 06:28 — Alcott, Aug 16 '12 at 06:28

score 7 · Answer 2 · answered Aug 16 '12 at 06:04

7

If your program is the only program that is going to use the file, you can save internal structures "as is" using binary files.

However, if you want to exchange the files with other programs, or over the Internet, then binary formats are not that good. Think for example about the problem with big-endian vs. little-endian machines. Also, the receiver of the files or data will most likely not have access to your code and your structures, so a text-based format might be easier to parse and implement into own structures.

About performance, it's true that reading and writing your internal structures directly will be quicker, because you don't have to translate them (also known as marshaling) into another format.

answered Aug 16 '12 at 06:04

Some programmer dude

400,186
35
402
621

+1. And as you pointed out, I'm the only one using those data object/structures, and I want to save them into and retrieve them back from file. In this case, I don't think text file will help, by **text file**, you mean I should write each data object/structure's value into the file as **plain text**, and then read these text back and using them as value to construct the original data object? – Alcott Aug 16 '12 at 06:12
@Alcott If you're the only one reading and writing these files then you can use binary format, and just read/write the structures directly. However, be careful of pointers! Writing a structure containing a pointer write the actual pointer value, not what it points to. When reading it later it will now point to some unallocated memory area. Also, when reading and writing strings, think about the terminating `'\0'` character. – Some programmer dude Aug 16 '12 at 06:16
@Alcott If writing as text, you can use simple plain text, one value per line, or several values per line with a separator (e.g. CSV files). Or use more complicated formats like XML or JSON. It's totaly up to you. :) – Some programmer dude Aug 16 '12 at 06:17
Thanks for the tips. What if I write those data structures into a file using binary mode, and later I read the file using text mode, will I still get the stuff I put into the file? – Alcott Aug 16 '12 at 06:21
@Alcott Opening a file in text mode might cause reading or writing a file to do some "translations" of certain characters. Most notably the newlines may be converted to/from `'\n'` and `"\r\n"`. So if you have some value in the file which corresponds to `'\n' (10 decimal) then when reading you might get back two bytes instead of one (13 and 10). – Some programmer dude Aug 16 '12 at 06:23
He asked about binary _mode_, not binary format (although he may not really understand the difference). If you're writing to the Internet, you want binary mode, not text, because you want to control the representation of things like line endings. – James Kanze Aug 16 '12 at 08:16
@JamesKanze, so what's the difference between binary mode and format? – Alcott Aug 16 '12 at 12:28
As I explained in my answer. The mode is the flag you use to open the file, and determines how the library maps line endings and end of file---and nothing else. The format is how you output (and thus read) the data: XML or JSon are text formats, XDR binary, for example. – James Kanze Aug 16 '12 at 12:41

score 7 · Answer 3 · answered Aug 16 '12 at 08:14

Historically, binary mode is to provide more or less transparent access to the underlying stream; text mode "normalizes" to a standard text representation, where lines are terminated by the single '\n' character. In addition, the system may impose restrictions on the size of a binary file, for example by requiring it to be a multiple of 128 or 512 bytes. (The first was the case of CP/M, the second of many of the DEC OS's.) Text files don't have this restriction, and in cases where the OS imposed it, the library will typically introduce an additional end of file character for text files. (Even today, most Windows libraries recognize the old CP/M end of file, 0x1A, when reading in text mode.) Because of these considertaions, text mode is only defined over a limited set of binary values. (But if you write 200 bytes to a binary file, you may get back 256 or 512 when you re-read it. Historically, binary should only be used for text that is otherwise structured, so that you can recognize the logical end, and ignore these additional bytes.)

Also, you can seek pretty much arbitrarily in a file opened in binary mode; you can only seek to the beginning, or to a position you've previously memorized, in text mode. (This is because the line ending mappings mean that there is no simple relationship between the position in the file, and the position in the text stream.)

Note that this is orthogonal to whether the output is formatted or not: if you output using << (and input using >>), the IO is formatted, regardless of the mode in which the file was opened. And the formatting is always text; the iostreams are designed to manipulate streams of text, and only have limited support for non-text input and output.

Today, the situation has changed somewhat: in many cases, we expect what we write to be readable from other machines, which supposes a well defined format, which may not be the format used natively. (Thus, for example, the Internet expects the two byte sequence 0x0D, 0x0A as a line ending, which is different than what is used internally in Unix and many other OS's.) If portability is a concern, you generally define a format, write it explicitly, and use binary mode to ensure that what you write is exactly what is written; similarly on input, you use binary format, and handle the conventions manually. If you're just writing to a local disk, which isn't shared, however, text mode is fine, and a bit less work.

Again, both of these apply to text. If you want a binary format, you must use binary mode, but that's far from sufficient. You'll have to implement all of the formatted IO yourself. In such cases, I generally don't use std::istream or std::ostream (whose abstraction is text), but rather define my own stream types, deriving from std::ios_base (for the error handling conventions), and using std::streambuf (for the physical IO).

Finally, don't neglect the fact that all IO is formatted in some manner. Just writing a block of memory out to the file means that the format is whatever the current implementation happens to give you (which is generally undocumented, which means that you probably won't be able to read it in the future). If all you're doing is spilling to disk, and the only time you'll read it is with the same program, compiled with the same version of the same compiler, using the same compiler options, then you can just dump memory, provided the memory in question is only PODs, and contains no pointers. Otherwise, you have to define (and document) the format you use, and implement it. In such cases, I'd suggest using an existing format, like XDR, rather than inventing your own: it's a lot easier to write "uses XDR format" as documentation, rather than describing the actual bit and byte layout for all of the different types.

+1 for the detailed answer, but I can't say that I fully understand, :-). Why can't I seek arbitrarily in text mode? Using `seekg(pos)`, I can almost seek to every position of the file, right? — Alcott, Aug 16 '12 at 13:04
@Alcott Because the standard says it's undefined behavior. If `pos` is a value returned from a call to `tellg()`, or if `pos` is `0`, there is no problem. Otherwise, it's undefined behavior. (In fact, it will work under Unix, and place you slightly ahead of where you want to go under Windows. Under other OS's? Who knows.)\ — James Kanze, Aug 16 '12 at 13:42

score 3 · Answer 4 · answered Aug 16 '12 at 06:03

If you read/write a file in a text mode, you are operating text. It might be a subject of encoding errors, and OS-specific format changes, though sometimes it may work just fine. In binary mode, though, you will not meet these restrictions. Also, text mode may do funny things with \n characters, such as replacing them with \n\r.

Fopen reference, for example, says:

In the case of text files, depending on the environment where the application runs, some special character conversion may occur in input/output operations to adapt them to a system-specific text file format. In many environments, such as most UNIX-based systems, it makes no difference to open a file as a text file or a binary file; Both are treated exactly the same way, but differentiation is recommended for a better portability.

this replacement takes away some performance since the code has to inspect every single character. — Tobias Langner, Aug 16 '12 at 06:09
@TobiasLangner, so the `\n`/`\r\n` replacement will be a performance problem? — Alcott, Aug 16 '12 at 06:13

score 2 · Answer 5 · answered Aug 16 '12 at 06:07

2

Only a few operating systems are affected by the choice between binary and text mode. None of the Unix or Linux systems do anything special for text mode—that is, text is the same as binary.

Windows and VMS in particular transform data in text mode. Windows transforms \n into \r\n when writing to a file and the converse when reading. VMS has a file record structure to observe, so in the default mode, it translates \n into a record delimiter.

Where it is different, binary is faster. If it is not different, it makes no difference.

answered Aug 16 '12 at 06:07

wallyk

56,922
16
83
148

If it is different, will the performance difference be significant? – Alcott Aug 16 '12 at 06:16
@Alcott: In ordinary cases, I would not expect a significant difference in performance. However, it would be easy to construct a test where there is a significant difference simply by heavy use of `\n` and light on everything else. At the worst, Windows would double the amount of data being written and VMS would go bonkers creating lots of records. – wallyk Aug 16 '12 at 08:53

score 2 · Answer 6 · answered Aug 16 '12 at 06:11

2

In binary mode you have got a size of byte(consider 256 ) to be utilized and in text mode its hardly more than 100 characters. Obviously you are going to gain more than double size for storing data.
Further there are cases where you have to abide by structure specification such as a network packet like IPv4.

Let us take an example

//No padding
typedef struct abc
{
 int a:4
 char b;
 double c;
} A[]={{.a=4,.b='a',.c=7.45},{.a=24,.b='z',.c=3.2}} ;

Isn't it difficult to store bit fields in text mode.obviously you gonna loose so many things.

However you can save data object in text format as done using MIME,but it will require an extra routine to to convert in binary mode; Performance hammered.

answered Aug 16 '12 at 06:11

perilbrain

7,961
2
27
35

+1 for the code. In your code, you mean I'm better off writing `A` into file using text mode? If so, how? Just write each data member's value into file as plain text and then read the values back to create the data object? – Alcott Aug 16 '12 at 06:26
:) ,it will be very difficult, you can write in text mode using one of the method called XML like `4` but finally you will have to convert into binary for normal operation.In normal binary just keep dumping values of struct into file.During read operation,if destination struct is as per specification you wont have to worry about how to read.As the cursor goes ahead, arrays will keep on filling. – perilbrain Aug 16 '12 at 06:36

score 0 · Answer 7 · answered Sep 09 '16 at 13:32

0

Binary format is more accurate for storing the numbers as they are stored in tha exact internal representation. There are no conversations while saving the data and therefore saving is much faster.

answered Sep 09 '16 at 13:32

astha

1

file operation in binary vs text mode -- performance concern

7 Answers7

Linked