10

I need to read a large binary file (~1GB) into a std::vector<double>. I'm currently using infile.read to copy the whole thing into a char * buffer (shown below) and I currently plan to convert the whole thing into doubles with reinterpret_cast. surely there must be a way to just put the doubles straight into the vector?

I'm also not sure about the format of the binary file, the data was produced in python so it's probably all floats

ifstream infile(filename, std--ifstream--binary);

infile.seekg(0, infile.end);     //N is the total number of doubles
N = infile.tellg();              
infile.seekg(0, infile.beg);

char * buffer = new char[N];

infile.read(buffer, N);
Aaron Sear
  • 113
  • 1
  • 2
  • 9
  • Is there a reason you want to use doubles? Binary data is normally represented as a char since it, on most platforms, occupies a single byte. – Freddy Feb 24 '15 at 22:52
  • 4
    If you don't know the format of the file how did you plan to convert it? – Retired Ninja Feb 24 '15 at 22:53
  • 1
    Ummmmm.... you want to read a binary file not knowing the format as something else than just a stream of bytes? – luk32 Feb 24 '15 at 22:54
  • 2
    Map the file into memory, construct the vector from that. – Alan Stokes Feb 24 '15 at 22:56
  • Nobody has said anything about endianness yet. Maybe portability doesn't matter in this case. Also, asking the OS for 1GB of contiguous data is not generally a great idea. Consider whether a container like `std::deque` would suit your requirements. – paddy Feb 25 '15 at 00:27

1 Answers1

12

Assuming the entire file is double, otherwise this wont work properly.

std::vector<double> buf(N / sizeof(double));// reserve space for N/8 doubles
infile.read(reinterpret_cast<char*>(buf.data()), buf.size()*sizeof(double)); // or &buf[0] for C++98
Tony J
  • 599
  • 5
  • 16
  • 1
    The declaration zero-initialises, which seems a bit wasteful. – Alan Stokes Feb 24 '15 at 22:59
  • @AlanStokes On the other hand, the IO operation itself probably is the bottleneck. I would say this is fine until measurement proofs this to be significant. – Baum mit Augen Feb 24 '15 at 23:11
  • You certainly do not want to (and most certainly cannot) save 1GB of data in an `std::array`. – Baum mit Augen Feb 24 '15 at 23:22
  • Btw, you also need to be sure the binary dump and `vector` backing storage use the same alingment. It *should* be safe with `double`... but should can blow up some sunny day. – luk32 Feb 24 '15 at 23:25
  • 2
    @BaummitAugen Curious, why isn't std::array suitable for large data? Is the the storage on the stack? – Tony J Feb 24 '15 at 23:25
  • @luk32, This is safe. You maybe thinking the reverse case where it's a std::vector v and v.data() is being casted into a double* – Tony J Feb 24 '15 at 23:30
  • @BaummitAugen 1GB isn't that much these days. Both `array` and `vector` store the data contiguously; why do you think there would be a problem? – Alan Stokes Feb 24 '15 at 23:32
  • @Alan Because various sources such as: http://stackoverflow.com/questions/4424579/stdvector-versus-stdarray-in-c or [here](http://binglongx.com/2011/05/08/c-and-c-arrays-on-stack-and-heap/) state it might be stored on stack (you need to work around for it to be allocated on heap). I cannot find any sources for this though =/ – luk32 Feb 24 '15 at 23:34
  • @Tony Not necessarily. This code assumes, the binary dump held in `infile` is compatible with the run-time environment. "*entire file is double*" is not precise enough. – luk32 Feb 24 '15 at 23:43
  • 4
    @AlanStokes @TonyJiang Because `std::array` is (by design) a zero overhead wrapper for C-arrays (the `int arr[100];` kind). Those end up on the stack (unless it has `static` storage duration or allocated dynamically, but don't do the latter in this case). – Baum mit Augen Feb 24 '15 at 23:57
  • I'm using std::vector to keep the data compatible with a load of pre-existing code I've been given to write around. I might swap the vectors for arrays once I've got everything else working – Aaron Sear Feb 25 '15 at 13:32
  • Isn't it undefined behavior to write to a reinterpreted char* unlike reading it? – asu Nov 03 '16 at 18:37
  • @Asu No, it's no UB, because it's reading raw binary data into a buffer, which just happens to be a double* allocated in std::vector. – Tony J Nov 14 '16 at 23:25