Write data from C++ Vector to text file fast

Question

I would like to know the best way to write data from vector<string> to text file fast as the data would involve few millions lines.

I have tried ofstream (<<) in C++ as well as fprintf using C, yet, the performance between them is little as i have recorded the time that is used to generate the required file.

vector<string> OBJdata;

OBJdata = assembleOBJ(pointer, vertexCount, facePointer);

FILE * objOutput;
objOutput = fopen("sample.obj", "wt");
for (int i = 0; i < OBJdata.size(); i++)
{
    fwrite(&OBJdata[i],1, sizeof(OBJdata[i].length()),objOutput );
}
fclose(objOutput);

you'll need to arrange the buffer into huge chunks or one big buffer for the strings if possible, then writing it out can be done in 1 call — paulm, Apr 20 '15 at 07:33
@paulm, can i know what do you meant by arranging the buffer in huge chunks?? Basically, i have all those lines of string appended to a vector before writing it all out. — vincent911001, Apr 20 '15 at 07:35
You may want to look here: http://stackoverflow.com/questions/11563963/writing-a-binary-file-in-c-very-fast — demonplus, Apr 20 '15 at 07:37
@demonplus, fwrite is applicable for writing out binary file, right?? However, i want to output an .obj file(3D) which is in ascii format, so, is that possible?? — vincent911001, Apr 20 '15 at 07:50
@demonplus, i have tried your suggestion just now, but the output is not in text, thus, i think that might have something to do with my code. Can you see the edited post. — vincent911001, Apr 20 '15 at 08:09
I think it is wrong. The idea at the link given is to prepare large buffer and to write it at once and you have for working so many times as your vector long. Also not sure what is sizeof(vector) — demonplus, Apr 20 '15 at 08:14
@demonplus, thanks, can you give me some hint on how to setup large buffer to write all this data at once.. — vincent911001, Apr 20 '15 at 08:22
The most suitable I think is to allocate buffer of chars and fill it inside assembleOBJ and then write in once. — demonplus, Apr 20 '15 at 08:33
`sizeof(OBJdata[i].length())` It's hard to imagine that's really what you want. Why do you want the size of the length? — David Schwartz, Apr 20 '15 at 08:34
@demonplus, ok sure, i will try to implement it, thanks a lot for your patience and guidance. — vincent911001, Apr 20 '15 at 08:35
@DavidSchwartz, it is probably my misunderstanding in approaching the parameter of fwrite as i have seen in some examples that the third parameter represents the number of element, thus, i am using the size() to return the size. — vincent911001, Apr 20 '15 at 08:39
Right, but why are you computing the *size* of the *length*? You're not writing the length, so why do you care how big the length is? — David Schwartz, Apr 20 '15 at 08:59
@DavidSchwartz, my bad, so, can you clarify what would be the correct representation of the parameter?? — vincent911001, Apr 20 '15 at 09:04
@DavidSchwartz, thanks a lot, i think i understand the concept now.. — vincent911001, Apr 20 '15 at 09:11
@vincent911001 You are writing the binary representation of the string object, which is mostly a set of addresses, not the string you want to have. So normally it would look like this: fwrite(OBJdata[i].c_str(),1,OBJdata[i].length(),objOutput); — Meixner, Apr 20 '15 at 09:41
@vincent911001 seems Peter has wrote an answer explaining what I mean, all strings wrote in one file system call via a large buffer or chunked buffers of strings — paulm, Apr 20 '15 at 10:20
After rethinking that - preparing a buffer for more efficient write: that is exactly what fprintf does - it provides buffered i/o. So without using some far more advance technique, I do not think there will be a performance increase. One idea: If you can determine some other fixed point in the file, use two filestreams, One starts in the beginning, the other in the middle (or at any other fixed point). — Mario The Spoon, Apr 20 '15 at 13:20

Peter · Accepted Answer · 2015-04-20T09:27:01.080

5

There is no "best". There are only options with different advantages and disadvantages, both of which vary with your host hardware (e.g. writing to a high performance drive will be faster than a slower on), file system, and device drivers (implementation of disk drivers can trade-off performance to increase chances of data being correctly written to the drive).

Generally, however, manipulating data in memory is faster than transferring it to or from a device like a hard drive. There are limitations on this as, with virtual memory, data in physical memory may be transferred in some circumstances to virtual memory - on disk.

So, assuming you have sufficient RAM and a fast CPU, an approach like

 // assume your_stream is an object of type derived from ostream

 //   THRESHOLD is a large-ish positive integer

std::string buffer;
buffer.reserve(THRESHOLD);
for (std::vector<string>::const_iterator i = yourvec.begin(), end = yourvec.end(); i != end; ++i)
{
     if (buffer.length() + i->length + 1 >= THRESHOLD)
     {
          your_stream << buffer;
          buffer.resize(0);
     }
     buffer.append(*i);
     buffer.append(1, '\n');
}
your_stream << buffer;

The strategy here is reducing the number of distinct operations that write to the stream. As a rule of thumb, a larger value of THRESHOLD will reduce the number of distinct output operations, but will also consume more memory, so there is usually a sweet spot somewhere in terms of performance. The problem is, that sweet spot depends on the factors I mentioned above (hardware, file system, device drivers, etc). So this approach is worth some effort to find the sweet spot only if you KNOW the exact hardware and host system configuration your program will run on (or you KNOW that the program will only be executed in a small range of configurations). It is not worth the effort if you don't know these things, since what works with one configuration will often not work for another.

Under windows, you might want to use win API functions to work with the file (CreateFile(), WriteFile(), etc) rather than C++ streams. That might give small performance gains, but I wouldn't hold my breath.

edited Apr 20 '15 at 09:27

answered Apr 20 '15 at 08:55

Peter

35,646
4
32
74

Thanks a lot Peter, will try it out later to see how it fares at my current system. – vincent911001 Apr 20 '15 at 09:01
`buffer.length() + i->length + 1`, can you clarify it??Thanks – vincent911001 Apr 20 '15 at 09:13
@vincent911001 The code in this answer should not only be reasonably fast, but it actually fixes *bugs* in the code in the question. First, you never want to write `&ObjData[i]`, since it is an `std::string`. Instead, pass `ObjData[i].as_c_str()`, which will return a pointer to the actual data. Second, the size you want to write out should be the size of the string data, computed as `ObjData[i].size()`. Applying `sizeof` to an `std::string` is misleading and wrong because what you care about is the size of the *data* contained in the string. – user4815162342 Apr 20 '15 at 09:18
@user4815162342, ya, thanks a lot for clarification, really appreciates it. – vincent911001 Apr 20 '15 at 09:23
Note: I've fixed a bug in my original code sample. The check of buffer.length() + i->length() + 1 is checking if appending the string plus newline exceeds threshold before trying to append them. – Peter Apr 20 '15 at 09:29
@Peter, sorry for late reply, how should i define the threshold value properly?? is it based on my memory or the size of data(size of vector) that i wished to output to?? – vincent911001 Apr 21 '15 at 07:28
It's hard to be specific on that. In general, you would need to test. Threshold would be more closely related to available physical memory (not virtual memory) than to size of the vector or strings in it. If you can analyse memory usage of programs that are likely to run with your program (before, during, or after) then 50% of what's left will probably be enough to avoid swapping. If not, no more than 10%. Those percentages are just guesses for a reasonable chance the program can run without swapping though. You'll still need to test if the performance meets your needs. – Peter Apr 21 '15 at 11:39
@Peter, thanks for your clarification, will need some time to digest it and try it out. Thanks a lot. By the way, from what you have given me, i have changed the structure of my code to integrate this feature from the beginning from computing vertices to faces. It does really enhance the time of writing significantly in which 800000 of lines is output to a text file in roughly 3 seconds – vincent911001 Apr 22 '15 at 08:11

score 1 · Answer 2 · answered Apr 20 '15 at 08:31

1

You may want to take a look at writev that allows you to write multiple elements at once - thus taking better advantage of the buffering. See: http://linux.die.net/man/2/writev

answered Apr 20 '15 at 08:31

Mario The Spoon

4,799
1
24
36

sorry for i havent mentioned that i am working under Windows using Visual C++, nevertheless, thanks a lot for your help – vincent911001 Apr 20 '15 at 08:36

Write data from C++ Vector to text file fast

2 Answers2

Linked