-3

Efficient way of writing and reading mixed datatypes (viz. unsigned integer, double, uint64_t, string) on file in c++.

I need to write and read a data containing mixed data-types on disk. I used the following method to write data. However it is turning out to be very slow.

fstream myFile;
myFile.open("myFile", ios::binary, ios::out);
double x; //with appropriate initialization
myFile<<x;
int y;
myFile<<y;
uint64_t z;
myFile<<z;
string myString;
myFile<<myString;

However, this method is turning out to be very inefficient for large data of size 20 GB. Can someone please suggest as to how I can quickly read and write mixed datatypes in c++

Jannat Arora
  • 2,759
  • 8
  • 44
  • 70
  • 2
    If you set out to parse 20 _gigabytes_ of data, you should probably perform some research on how to efficiently accomplish that task, starting from the basics. Do you have a colleague who may be able to help you? – Lightness Races in Orbit Feb 18 '17 at 02:25
  • @LightnessRacesinOrbit Unfortunately no – Jannat Arora Feb 18 '17 at 02:26
  • 1
    Then, and pardon me for saying so, it seems you and your team are ill-suited to this task. How did you end up saddled with it? – Lightness Races in Orbit Feb 18 '17 at 02:41
  • I haven't tried something like this before. Is this slower than just writing an array of integer with the same amount of bytes? the io is buffered so I guess the type of data doesn't exactly matter. all types of data are just bytes. how do you read/write the file? do you need insertion/deletion/accessing by index? Is it like a table where each row contains an int, an uint64_t, and a string? http://stackoverflow.com/questions/10770675/fastest-way-to-write-data-stream-to-disk – hamster on wheels Feb 18 '17 at 06:01
  • c vs c++ http://codereview.stackexchange.com/questions/94352/writing-a-large-file-of-numbers – hamster on wheels Feb 18 '17 at 06:08
  • You're using formatted output for "binary" data; of course it's going to be huge and slow. ;-] – ildjarn Feb 18 '17 at 13:21

1 Answers1

1

I think the first thing you need to determine is whether or not your program actually is slow.

What do I mean by that? Of course you think it is slow, but is it slow because your particular program is inefficient, or is it slow simply because writing 20 gigabytes of data to disk is an inherently time-consuming operation to perform?

So the first thing I would do is run some benchmark tests on your hard drive to determine its raw speed (in megabytes-per-second, or whatever). There are commercial apps that do this, or you could just use a built-in utility (like dd on Unix or Mac) to give you a rough idea of how long it takes your particular hard drive to read or write 20 gigabytes of dummy data:

dd if=/dev/zero of=junk.bin bs=1024 count=20971520

dd if=junk.bin of=/dev/zero bs=1024

If dd (or whatever) is able to transfer the data significantly faster than your program can, then there is room for your program to improve. On the other hand, if dd's speed isn't much faster than your program's speed, then there's nothing you can do other than go out and buy a faster hard drive (or maybe an SSD or a RAM drive or something).

Assuming the above test does indicate that your program is less efficient than it could be, the first thing I would try is replacing your C++ iostream calls with an equivalent implementation that uses the C fopen()/fread()/fwrite()/fclose() API calls instead. Some C++ iostream implementations are known to be somewhat inefficient, but it's unlikely that the (simpler) C I/O APIs are inefficient. If nothing else, comparing the performance of the C++ and C versions would let you either confirm or deny that your C++ library's iostreams implementation is a bottleneck.

If even the C API doesn't get you the speed you need, the next thing I would look at is changing your file format to something that is easier to read or write; for example, assuming you have sufficient memory might just use mmap() to associate a large block of virtual address space with the contents of a file, and then just read/write the file-contents as if it was RAM. (That might or might not make things faster, depending on how you access the data).

If all else fails, the final thing to do is reduce the amount of data you need to read or write. Are there parts of the data that you can store separately so that you don't need to read and write them every time? Is there data there that you can store more compactly (e.g. perhaps there are commonly used strings in your data that you could store as integer-codes instead of strings)? What if you use zlib to compress the data before you write it, so that there is less data to write? The data you appear to be writing in your example looks like it would might be amenable to compression, perhaps reducing your 20GB file to a 5GB file or so. Etc.

Community
  • 1
  • 1
Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234