6

http://insanecoding.blogspot.co.uk/2011/11/how-to-read-in-file-in-c.html reviews a number of ways of reading an entire file into a string in C++. The key code for the fastest option looks like this:

std::string contents;
in.seekg(0, std::ios::end);
contents.resize(in.tellg());
in.seekg(0, std::ios::beg);
in.read(&contents[0], contents.size());

Unfortunately, this is not safe as it relies on the string being implemented in a particular way. If, for example, the implementation was sharing strings then modifying the data at &contents[0] could affect strings other than the one being read. (More generally, there's no guarantee that this won't trash arbitrary memory -- it's unlikely to happen in practice, but it's not good practice to rely on that.)

C++ and the STL are designed to provide features that are efficient as C, so one would expect there to be a version of the above that was just as fast but guaranteed to be safe.

In the case of vector<T>, there are functions which can be used to access the raw data, which can be used to read a vector efficiently:

T* vector::data();
const T* vector::data() const; 

The first of these can be used to read a vector<T> efficiently. Unfortunately, the string equivalent only provides the const variant:

const char* string::data() const noexcept;

So this cannot be used to read a string efficiently. (Presumably the non-const variant is omitted to support the shared string implementation.)

I have also checked the string constructors, but the ones that accept a char* copy the data -- there's no option to move it.

Is there a safe and fast way of reading the whole contents of a file into a string?

It may be worth noting that I want to read a string rather than a vector<char> so that I can access the resulting data using a istringstream. There's no equivalent of that for vector<char>.

Mohan
  • 7,302
  • 5
  • 32
  • 55
  • 8
    *"[If] the implementation was sharing strings"* That's not legal as of C++11. – Baum mit Augen Sep 07 '16 at 22:17
  • 3
    *"Unfortunately, the string equivalent only provides the const variant:"* That will be fixed in C++17, see http://en.cppreference.com/w/cpp/string/basic_string/data – Baum mit Augen Sep 07 '16 at 22:19
  • 2
    C++ 17 will have a non const std::string member `CharT* data();` –  Sep 07 '16 at 22:19
  • @BaummitAugen: that's good to know. Is there actually something in the C++11 standard which guarantees that the `&contents[0]` (or better, `const_cast` on `data()`) is safe? – Mohan Sep 07 '16 at 22:20
  • Actually, it notes: "Modifying the character array accessed through data has undefined behavior." So my question stands for C++11... – Mohan Sep 07 '16 at 22:23
  • 1
    Can't find it right now but for 99.999999999999% of the time `&contents[0]` will work until C++17. Looking for the thread – NathanOliver Sep 07 '16 at 22:24
  • 1
    @Mohan That's somewhat murky water because `std::string` gets changed all the time, but iirc the `const_cast` is technically UB but the `&contents[0]` is fine. I'm sure this is answered somewhere on SO in detail. – Baum mit Augen Sep 07 '16 at 22:24
  • 1
    C++ filebufs are implemented in [terms of the C library](http://en.cppreference.com/w/cpp/io/basic_filebuf/seekoff), where seeking to the end of a binary stream has undefined behaviour. I've never seen it not work correctly, but caveat emptor. – user657267 Sep 07 '16 at 22:25
  • 2
    "If, for example, the implementation was sharing strings then modifying the data at &contents[0] could affect strings other than the one being read." -- this is 100% false. If that happens, it's a bug in the C++ library. – Sam Varshavchik Sep 07 '16 at 22:47
  • 1
    @Mohan: "*Is there actually something in the C++11 standard which guarantees that the &contents[0] (or better, const_cast on data()) is safe?*" Yes. `std::basic_string` is *required* to be a contiguously allocated array of characters. From C++14: [string.require]/3: "The char-like objects in a `basic_string` object shall be stored contiguously." – Nicol Bolas Sep 07 '16 at 23:24
  • @Mohan: "*C++ and the STL are designed to provide features that are efficient as C*" Which is too bad, since `iostream` was *not* designed by the same people who designed the STL. – Nicol Bolas Sep 07 '16 at 23:32
  • If you are looking for high performance reading of text files, you should use `mmap`. Check this question: http://stackoverflow.com/questions/17925051/fast-textfile-reading-in-c/17925143#17925143 – mvp Sep 07 '16 at 23:33
  • mvp: I had considered it but as it reduces portability will hold off unless profiling suggests it's really necessary. Also, mmapping is indeed faster than reading the whole file into memory, but there is no guarantee that the subsequent access will be fast, which is actually more important for the application in question. – Mohan Sep 07 '16 at 23:48

2 Answers2

2

If you really want to avoid copies, you can slurp the file into a std::vector<char>, and then roll your own std::basic_stringbuf to pull data from the vector.

You can then declare a std::istringstream and use std::basic_ios::rdbuf to replace the input buffer with your own one.

The caveat is that if you choose to call istringstream::str it will invoke std::basic_stringbuf::str and will require a copy. But then, it sounds like you won't be needing that function, and can actually stub it out.

Whether you get better performance this way would require actual measurement. But at least you avoid having to have two large contiguous memory blocks during the copy. Additionally, you could use something like std::deque as your underlying structure if you want to cope with truly huge files that cannot be allocated in contiguous memory.

It's also worth mentioning that if you're really just streaming that data you are essentially double-buffering by reading it into a string first. Unless you also require the contents in memory for some other purpose, the buffering inside std::ifstream is likely to be sufficient. If you do slurp the file, you may get a boost by turning buffering off.

paddy
  • 60,864
  • 6
  • 61
  • 103
  • It's a few megabytes of memory which will subsequently be accessed randomly, so the buffer inside the `ifstream` is definitely not going to do the trick. Avoiding copying is important because the memory footprint needs to be kept down. I may indeed roll my own `std::basic_stringbuf` or use the one provided by boost iostreams. – Mohan Sep 07 '16 at 23:50
1

I think using &string[0] is just fine, and it should work with the widely used standard library implementations (even if it is technically UB).

But since you mention that you want to put the data into an istringstream, here's an alternative:

  1. Read the data into a char array (new char[in.tellg()])
  2. Construct a stringstream (without the leading 'i')
  3. Insert the data with stringstream::write

The istringstream would have to copy the data anyway, because a std::stringstream doesn't store a std::string internally as far as I'm aware, so you can leave the std::string away and put the data into it directly.

EDIT: Actually, instead of the manual allocation (or make_unique), this way you could also use the vector<char> you mentioned.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
mooware
  • 1,722
  • 2
  • 16
  • 25
  • "The istringstream would have to copy the data anyway". What a pain -- I assumed it was efficient. Apparently "Boost.IOStreams has a stream that works like a stringstream, but wraps a native array, so you avoid having to copy the data." (http://stackoverflow.com/a/1448504/1908650) – Mohan Sep 07 '16 at 22:59
  • I forgot to mention the other alternative I was thinking of: why not just read the file with std::ifstream in the first place? – mooware Sep 07 '16 at 23:00
  • About the copying: apparently a std::stringbuf could store a std::string, but e.g. the MSVC STL doesn't as far as I remember. And it wouldn't help much anyway, since the istringstream ctor and str() method don't take rvalue refs as far as I can see. – mooware Sep 07 '16 at 23:03
  • Are you asking why am reading the file into a string and then using an `istringstream`? I need a transparent interface that can't tell whether it's accessing data in memory or on disk. The fact that `istringstream` and `ifstream` both derive from `istream` seemed to provide a convenient way of achieving this. But it's no use if the `istringstream` is copying the entire file. – Mohan Sep 07 '16 at 23:03
  • I think I still don't understand: If you're using the `std::istream` interface, then you could use a `std::ifstream` instance. – mooware Sep 07 '16 at 23:06
  • A certain body of code has to perform a lot of random access on some binary data. Users want the option to have that data stored either on disk (to keep memory footprint down) or in memory (for speed). The choice is made at runtime. So the code in question needs to be written in a way that either accesses data on disk or accesses data in memory. Data on disk can be accessed using a `ifstream`; data in memory (specifically a string) can be accessed using an `istringstream`. So if the code accepts an `istream&`, then it can handle either data on disk or data in memory. – Mohan Sep 07 '16 at 23:10