C++ Performance writing and reading from disk

Question

Possible Duplicate:
Writing a binary file in C++ very fast

I have a large number of unsigned 32 bit integers in memory (1.5 billion entries). I need to write them to a file and read them back into main memory.

Now, I do it using:

ofstream ofs;
ofs.open(filename);
for (uint64_t i = 0 ; i < 1470000000 ; i++)
ofs << integers << " " ;

and

ifstream ifs;
ifs.open(filename);
for (uint64_t i = 0 ; i < 1470000000 ; i++)
ifs >> integers ;

This takes a few minutes to execute. Can anybody help me, is there any library method to do it in a faster way? Or any suggestion, so I can run a performance test? Can anybody show me some simple C++ code that uses mmap for doing the above (on Linux)?

EDIT: EXAMPLE CASE

#include<iostream>
#include <stdint.h>
#include <cstdio>
#include <cstdlib>
#include <sstream>

using namespace std;

main()
{
      uint32_t* ele = new uint32_t [100] ;
      for(int i = 0; i < 100 ; i++ )
      ele[i] = i ;

      for(int i = 0; i < 100 ; i++ ){
          if(ele[i] < 20)
          continue ;
          else
          // write  ele[i] to file
          ;   
      }

 for(int i = 0; i < 100 ; i++ ){
          if(ele[i] < 20)
          continue ;
          else
          // read  number from file
          // ele[i] = number * 10 ;
          ;   
      }

     std::cin.get();
}

@PaulR You still have to format the data, or risk not being able to read it later. Of course, binary formatting can require a lot less CPU than text formatting. Not that we know that the time is due to the formatting; it could just as easily be due to the physical IO. — James Kanze, Jan 23 '13 at 09:14
Sorry, I forgot to mention, I really don't need file portability. And I do not know what will be the size, so, I have to write one by one. — alessandro, Jan 23 '13 at 09:19

score 3 · Accepted Answer · answered Jan 23 '13 at 09:13

The first thing to do is to determine where the time is going. Formatting and parsing text isn't trivial, and can take some time, but so can the actual writing and reading, given the size of the file. The second thing is to determine how "portable" the data have to be: the fastest solution is almost certainly to mmap (or its Windows equivalent) the array to the file directly, and never read or write. This doesn't provide a portable representation, however, and even upgrading the compiler might make the data unreadable. (Unlikely for 32 bit integers today, but it has happened in the past).

In general, if the time is going to reading and writing, you will want to investigate using mmap. If it is going to formatting and parsing, you will want to investigate some sort of binary format—this could also help reading and writing if it makes the resulting files smaller. The simplest binary format, writing the values using the normal network standard, requires no more than:

void
writeInt( std::ostream& dest, int32_t integer )
{
    dest.put( (integer >> 24) & 0xFF );
    dest.put( (integer >> 16) & 0xFF );
    dest.put( (integer >>  8) & 0xFF );
    dest.put( (integer      ) & 0xFF );
}

int32_t
readInt( std::istream& source )
{
    int32_t results = 0;
    results  = source.get() << 24;
    results |= source.get() << 16;
    results |= source.get() <<  8;
    results |= source.get();
    return results;
}

(Some error checking obviously needs to be added.)

If many of the integers are actually small, you could try some variable length encoding, such as that used in Google Protocol Buffers. If most of your integers are in the range -64...63, this could result in a file only a quarter of the size (which again, will improve the time necessary to read and write).

Sorry, I am very new in C++. Above code is little bit difficult to me to understand. I have added an example case in my OP to illustrate what I want. Can you kindly modify above code to show what is your suggestion. I really need read performance to be better. Write performance and File Portability is not really my concern. — alessandro, Jan 23 '13 at 10:20
If you're not concerned about being able to read the data in the future, then the fastest solution is `mmap`. For reading, this is a bit tricky, as you don't know the size beforehand, but both Windows and Unix have functions which will allow you to find out. But I think if you'll read and write as above, you'll probably be fast enough. (And one important thing I forgot: you must open the file in binary mode, or the above won't work.) — James Kanze, Jan 23 '13 at 18:38
I just looked at your example code. You can't use `mmap`, so you might as well use the above. — James Kanze, Jan 23 '13 at 18:40

score 2 · Answer 2 · answered Jan 23 '13 at 09:14

2

If you know the size just fwrite/fread an array.

answered Jan 23 '13 at 09:14

user877329

6,717
8
46
88

At the cost of not being able to read the file in the future, or on a different machine. – James Kanze Jan 23 '13 at 09:15
Is that *really* an issue due to the x86-64 architecture dominance? If it is, write the byte order in the begining of the file. – user877329 Jan 23 '13 at 09:17
Probably. It was on an 8086 where I've seen the problem between compiler versions. And of course, Sparc's, HP PA's, and IBM Power PC are far from dead, not to mention mainframes. I don't see an x86-64 dominance *except* for desktop machines. (In fact, ARM processors clearly outnumber x86-64.) – James Kanze Jan 23 '13 at 09:29
Sorry, I don't want to write the whole array. I have added an example case in my OP to illustrate what I want. – alessandro Jan 23 '13 at 10:18
@JamesKanze You are right in these stats. But it is still possible to read LE files on non-LE machines. So it is a matter of rewriting the code and you will still be able to read the file. – user877329 Jan 23 '13 at 11:08
@JamesKanze x86-64 also rules in HPC: https://en.wikipedia.org/wiki/TOP500#/media/File:Processor_families_in_TOP500_supercomputers.svg, though Power is still common. – user877329 Jul 08 '16 at 18:29

score 2 · Answer 3 · answered Jan 23 '13 at 09:18

2

You can likely get better performance by using a bigger buffer for both your input and output streams:

ofstream ofs;
char * obuffer = new char[bufferSize];
ofs.rdbuf ()->pubsetbuf (obuffer, bufferSize);
ofs.open (filename);

ifstream ifs;
char * ibuffer = new char[bufferSize];
ifs.rdbuf ()->pubsetbuf (ibuffer, bufferSize);
ifs.open (filename);

Also ifs >> integers ; is fairly slow way to parse just integers. Try to read lines and then use std::strtol() to parse them. IME, it is measurably faster.

answered Jan 23 '13 at 09:18

wilx

17,697
6
59
114

I have added an example case in my OP to illustrate what I want. Can you kindly modify above code to show what is your suggestion. I really need read performance to be better. Write performance is not really my concern. – alessandro Jan 23 '13 at 11:19
The iostreams have generally optimized the buffer size, so it's unlikely that you'll gain much here. – James Kanze Jan 23 '13 at 18:39

score 0 · Answer 4 · answered Jan 23 '13 at 09:24

0

If you just want to copy you can use that for better performance:

std::ifstream  input("input");
std::ofstream  output("ouptut");
output << input.rdbuf();

or maybe setting the buffer size may increase the speed :

char cbuf[buf_size];
ifstream fin;
fin.rdbuf()->pubsetbuf(cbuf,buf_size);

I didn't consider long int issue in my answer because I simply don't know why they should effect stream performance but I hope it helps anyways .

answered Jan 23 '13 at 09:24

Kadir Erdem Demir

3,531
3
28
39

Sorry, I don't to copy input stream to output stream. I have added an example case in my OP to illustrate what I want. Can you kindly modify above code to show what is your suggestion. I really need read performance to be better. Write performance is not really my concern. – alessandro Jan 23 '13 at 10:17

C++ Performance writing and reading from disk

4 Answers4