102

Lately I've been asked to write a function that reads the binary file into the std::vector<BYTE> where BYTE is an unsigned char. Quite quickly I came with something like this:

#include <fstream>
#include <vector>
typedef unsigned char BYTE;

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::streampos fileSize;
    std::ifstream file(filename, std::ios::binary);

    // get its size:
    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // read the data:
    std::vector<BYTE> fileData(fileSize);
    file.read((char*) &fileData[0], fileSize);
    return fileData;
}

which seems to be unnecessarily complicated and the explicit cast to char* that I was forced to use while calling file.read doesn't make me feel any better about it.


Another option is to use std::istreambuf_iterator:

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::ifstream file(filename, std::ios::binary);

    // read the data:
    return std::vector<BYTE>((std::istreambuf_iterator<char>(file)),
                              std::istreambuf_iterator<char>());
}

which is pretty simple and short, but still I have to use the std::istreambuf_iterator<char> even when I'm reading into std::vector<unsigned char>.


The last option that seems to be perfectly straightforward is to use std::basic_ifstream<BYTE>, which kinda expresses it explicitly that "I want an input file stream and I want to use it to read BYTEs":

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::basic_ifstream<BYTE> file(filename, std::ios::binary);

    // read the data:
    return std::vector<BYTE>((std::istreambuf_iterator<BYTE>(file)),
                              std::istreambuf_iterator<BYTE>());
}

but I'm not sure whether basic_ifstream is an appropriate choice in this case.

What is the best way of reading a binary file into the vector? I'd also like to know what's happening "behind the scene" and what are the possible problems I might encounter (apart from stream not being opened properly which might be avoided by simple is_open check).

Is there any good reason why one would prefer to use std::istreambuf_iterator here?
(the only advantage that I can see is simplicity)

jww
  • 97,681
  • 90
  • 411
  • 885
LihO
  • 41,190
  • 11
  • 99
  • 167
  • 1
    @R.MartinhoFernandes: What I meant with it was that 3rd option doesn't seem to be any better than 2nd option. – LihO Feb 28 '13 at 15:30
  • someone measured it (in 2011) , for loading into string at least. http://insanecoding.blogspot.hk/2011/11/how-to-read-in-file-in-c.html – jiggunjer Jul 12 '15 at 16:22
  • A safer way to find the size: use the special [`ignore()`](http://en.cppreference.com/w/cpp/io/basic_istream/ignore) count: `file.ignore(std::numeric_limits::max());`, and return the `std::streamsize` 'extracted' using `auto size =`[`file.gcount();`](http://en.cppreference.com/w/cpp/io/basic_istream/gcount) – Brett Hale Aug 01 '16 at 01:43

5 Answers5

62

When testing for performance, I would include a test case for:

std::vector<BYTE> readFile(const char* filename)
{
    // open the file:
    std::ifstream file(filename, std::ios::binary);

    // Stop eating new lines in binary mode!!!
    file.unsetf(std::ios::skipws);

    // get its size:
    std::streampos fileSize;

    file.seekg(0, std::ios::end);
    fileSize = file.tellg();
    file.seekg(0, std::ios::beg);

    // reserve capacity
    std::vector<BYTE> vec;
    vec.reserve(fileSize);

    // read the data:
    vec.insert(vec.begin(),
               std::istream_iterator<BYTE>(file),
               std::istream_iterator<BYTE>());

    return vec;
}

My thinking is that the constructor of Method 1 touches the elements in the vector, and then the read touches each element again.

Method 2 and Method 3 look most promising, but could suffer one or more resize's. Hence the reason to reserve before reading or inserting.

I would also test with std::copy:

...
std::vector<byte> vec;
vec.reserve(fileSize);

std::copy(std::istream_iterator<BYTE>(file),
          std::istream_iterator<BYTE>(),
          std::back_inserter(vec));

In the end, I think the best solution will avoid operator >> from istream_iterator (and all the overhead and goodness from operator >> trying to interpret binary data). But I don't know what to use that allows you to directly copy the data into the vector.

Finally, my testing with binary data is showing ios::binary is not being honored. Hence the reason for noskipws from <iomanip>.

jww
  • 97,681
  • 90
  • 411
  • 885
  • 1
    Is there a way to read a specific size into the array instead of the whole file as described here? – superhero Dec 07 '14 at 15:22
  • 1
    I thought you only need `file.unsetf(std::ios::skipws);` if using the operator>> – jiggunjer Jul 12 '15 at 16:10
  • I needed `file.unsetf(std::ios::skipws);` even when using `std::copy` to copy to a `vector`, otherwise I would lost data. This was with Boost 1.53.0. – phoenix Nov 27 '17 at 18:06
  • 1
    @jiggunjer `std::istream_iterator` uses `>>` operator internally to extract data from the stream. – tomi.lee.jones Jan 30 '18 at 11:10
  • Tried more than 8 snippet, none of them worked but this , Thanks a lot! +1 –  Jun 26 '19 at 15:37
  • It seems we don't need add file.unsetf(std::ios::skipws) under gcc 8.2.0 – leiyc Aug 12 '21 at 09:46
  • 3
    Use of vector::insert() and iterators is awfully slow. Probably because of calling a lot of virtual functions that read each byte. I cannot even wait until it finishes reading a huge file (3 GB in my case), and this is in Release mode. By changing the last part to this I got a multitude of speedup. `std::vector vec;` `vec.resize(fileSize);` `file.read(reinterpret_cast(&vec.front()), fileSize);` – Anton Breusov Oct 30 '21 at 19:28
  • @jww Can I use the function `readFile` under a Apache 2.0 License? – Vertexwahn Aug 06 '23 at 12:20
30
std::ifstream stream("mona-lisa.raw", std::ios::in | std::ios::binary);
std::vector<uint8_t> contents((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());

for(auto i: contents) {
    int value = i;
    std::cout << "data: " << value << std::endl;
}

std::cout << "file size: " << contents.size() << std::endl;
neoneye
  • 50,398
  • 25
  • 166
  • 151
  • 1
    Note that this doesn't give any errors (no exception or anything). You just get empty `contents`. You can check for file errors with `if (!stream)` but I don't know if there is any way to check for read errors. – Timmmm Nov 17 '22 at 14:40
8

Since you are loading the entire file into memory the most optimal version is to map the file into memory. This is because the kernel loads the file into kernel page cache anyway and by mapping the file you just expose those pages in the cache into your process. Also known as zero-copy.

When you use std::vector<> it copies the data from the kernel page cache into std::vector<> which is unnecessary when you just want to read the file.

Also, when passing two input iterators to std::vector<> it grows its buffer while reading because it does not know the file size. When resizing std::vector<> to the file size first it needlessly zeroes out its contents because it is going to be overwritten with file data anyway. Both of the methods are sub-optimal in terms of space and time.

Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Yes, if the content doesn't need to be in a vector, that is definitely the best method. – Mats Petersson Feb 28 '13 at 15:07
  • 1
    rather than `resize`, `reserve` doesnt initialize. – jiggunjer Jul 12 '15 at 16:25
  • meaning you can pass the iterators to a reserved vector to avoid redundant resizing. Referring to your last paragraph. – jiggunjer Jul 13 '15 at 10:14
  • 1
    @jiggunjer Well, that would not work because you cannot access the reserved capacity without resizing the vector first. – Maxim Egorushkin Jul 13 '15 at 10:39
  • I don't understand. If you use `reserve(x)` no reallocation will happen if you add less than x elements to the vector. – jiggunjer Jul 13 '15 at 12:02
  • @jiggunjer Ah, I understand now that what you are saying is already in the accepted answer... – Maxim Egorushkin Jul 13 '15 at 15:09
  • 1
    For someone reading without reference to the standard, this is unclear. It doesn't explain _how_ to map to memory - I _assume_ `streambuf` and `basic` do this?. Also, the terminology assumes Linux/UNIX is the OS used, which feels like it might not be applicable for all platforms - do the same concepts and best practices exist in all OSes targetable by C++? – underscore_d Nov 08 '15 at 18:14
  • @underscore_d C++ standard library does not provide portable facilities for mapping files into memory. Boost does, see Boost Interprecess library, [Memory Mapped Files](http://www.boost.org/doc/libs/1_59_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file) for more details. – Maxim Egorushkin Nov 08 '15 at 18:51
  • @underscore_d I assume Linux OS because it is free to use, learn and modify and plenty of free documentation is available. – Maxim Egorushkin Nov 08 '15 at 18:53
3

I would have thought that the first method, using the size and using stream::read() would be the most efficient. The "cost" of casting to char * is most likely zero - casts of this kind simply tell the compiler that "Hey, I know you think this is a different type, but I really want this type here...", and does not add any extra instrucitons - if you wish to confirm this, try reading the file into a char array, and compare the actual assembler code. Aside from a little bit of extra work to figure out the address of the buffer inside the vector, there shouldn't be any difference.

As always, the only way to tell for sure IN YOUR CASE what is the most efficient is to measure it. "Asking on the internet" is not proof.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
1

The class below extends vector with a binary file load and save. I returned to this question multiple times already, so this is the code for my next return - and for all others who will be looking for the binary file save method next. :)

#include <cinttypes>
#include <fstream>
#include <vector>

// The class offers entire file content read/write in single operation
class BinaryFileVector : public std::vector<uint8_t>
{
    public:

        using std::vector<uint8_t>::vector;

        bool loadFromFile(const char *fileName) noexcept
        {
            // Try to open a file specified by its name
            std::ifstream file(fileName, std::ios::in | std::ios::binary);
            if (!file.is_open() || file.bad())
                return false;

            // Clear whitespace removal flag
            file.unsetf(std::ios::skipws);

            // Determine size of the file
            file.seekg(0, std::ios_base::end);
            size_t fileSize = file.tellg();
            file.seekg(0, std::ios_base::beg);

            // Discard previous vector content
            resize(0);
            reserve(0);
            shrink_to_fit();

            // Order to prealocate memory to avoid unnecessary reallocations due to vector growth
            reserve(fileSize);

            // Read entire file content into prealocated vector memory
            insert(begin(),
                std::istream_iterator<uint8_t>(file),
                std::istream_iterator<uint8_t>());

            // Make sure entire content is loaded
            return size() == fileSize;
        }

        bool saveToFile(const char *fileName) const noexcept
        {
            // Write entire vector content into a file specified by its name
            std::ofstream file(fileName, std::ios::out | std::ios::binary);
            try {
                file.write((const char *) data(), size());
            }
            catch (...) {
                return false;
            }

            // Determine number of bytes successfully stored in file
            size_t fileSize = file.tellp();
            return size() == fileSize;
        }
};

Usage example

#include <iostream>

int main()
{
    BinaryFileVector binaryFileVector;

    if (!binaryFileVector.loadFromFile("data.bin")) {
        std::cout << "Failed to read a file." << std::endl;
        return 0;
    }

    if (!binaryFileVector.saveToFile("copy.bin")) {
        std::cout << "Failed to write a file." << std::endl;
        return 0;
    }

    std::cout << "Success." << std::endl;
    return 0;
}
no one special
  • 1,608
  • 13
  • 32