17

I am trying to map a file to memory and then parse line by line- is istream what I should be using?

Is istream the same as mapping a file to memory on Windows? I have had difficulties trying to find a complete example of mapping a file into memory.

I have seen people link memory mapping articles from MSDN, but if anybody could recommend a small (~15 line?) example I would be most thankful.

I must be searching for the wrong thing, but when searching "C++ memory mapping example" on Google, I could not find an example that included iterating through.

These were the closest results (just so people realize I have looked):

Scott Smith
  • 3,900
  • 2
  • 31
  • 63
user997112
  • 29,025
  • 43
  • 182
  • 361
  • 4
    "*I am trying to map a file to memory and **then** parse line by line*". Can you tell us why you want to memory-map the file? Why isn't parsing line-by-line (using, say, `ifstream` or `fopen`) sufficient? – Robᵩ May 31 '12 at 19:21
  • 3
    @Rob, purely for performance reasons. I was under the (false?) impression it's faster mapping the whole file? – user997112 May 31 '12 at 19:43
  • 3
    @user997112 : It depends on what you do with the data. If you're using it for a proper parser that implements backtracking, then memory mapped files are uncontestedly faster; but if you're just iterating forwards through the data (as with multiple simple `std::getline` calls), I doubt there will be any noticeable difference. Certainly there's no _harm_ in using a memory mapped file, though, unless you're low on virtual address space (probably only an issue in 32-bit code with GB+ size files). – ildjarn May 31 '12 at 19:53
  • `mmap`-ing the file could be *slightly* faster. BTW, on Linux, [fopen(3)](http://man7.org/linux/man-pages/man3/fopen.3.html) knows about the `m` mode modifier to map the file. However, are you sure it is worth the trouble? Did you benchmark? I guess you'll win only a few percent of performance. Does that matter to you? – Basile Starynkevitch Oct 21 '14 at 06:51

3 Answers3

15

std::istream is an abstract type – you cannot use it directly. You should be deriving from it with a custom array-backed streambuf:

#include <cstddef>
#include <string>
#include <streambuf>
#include <istream>

template<typename CharT, typename TraitsT = std::char_traits<CharT>>
struct basic_membuf : std::basic_streambuf<CharT, TraitsT> {
    basic_membuf(CharT const* const buf, std::size_t const size) {
        CharT* const p = const_cast<CharT*>(buf);
        this->setg(p, p, p + size);
    }

    //...
};

template<typename CharT, typename TraitsT = std::char_traits<CharT>>
struct basic_imemstream
: virtual basic_membuf<CharT, TraitsT>, std::basic_istream<CharT, TraitsT> {
    basic_imemstream(CharT const* const buf, std::size_t const size)
    : basic_membuf(buf, size),
      std::basic_istream(static_cast<std::basic_streambuf<CharT, TraitsT>*>(this))
    { }

    //...
};

using imemstream = basic_imemstream<char>;

char const* const mmaped_data = /*...*/;
std::size_t const mmap_size = /*...*/;
imemstream s(mmaped_data, mmap_size);
// s now uses the memory mapped data as its underlying buffer.

As for the memory-mapping itself, I recommend using Boost.Interprocess for this purpose:

#include <cstddef>
#include <string>
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>

namespace bip = boost::interprocess;

//...

std::string filename = /*...*/;
bip::file_mapping mapping(filename.c_str(), bip::read_only);
bip::mapped_region mapped_rgn(mapping, bip::read_only);
char const* const mmaped_data = static_cast<char*>(mapped_rgn.get_address());
std::size_t const mmap_size = mapped_rgn.get_size();

Code for imemstream taken from this answer by Dietmar Kühl.

ildjarn
  • 62,044
  • 9
  • 127
  • 211
  • 2
    where does the input from mmaped_data come from? We need a reference to the file I presume? – user997112 May 31 '12 at 19:15
  • 1
    @user997112 : That depends on what platform you're on -- standard C++ does not provide memory mapped files. On *nix, there's `mmap`; on Windows, there's `CreateFileMapping`. Personally, I use [Boost.Interprocess](http://www.boost.org/libs/interprocess/)'s memory mapped files, as they're cross-platform; I'll edit in an example for that. – ildjarn May 31 '12 at 19:20
  • Thank you, appreciate it greatly – user997112 May 31 '12 at 19:29
  • @user997112 - If you are mapping a text file on Windows, you may need to deal with `'\r'` in the istringstream manually. – Robᵩ May 31 '12 at 19:31
  • Will I be iterating byte by byte in this example? – user997112 May 31 '12 at 19:31
  • @Robᵩ : Changing the underlying buffer does not affect the way the stream treats whitespace; this doesn't require treating the stream any differently than if it were a standard `ifstream`. – ildjarn May 31 '12 at 19:32
  • @user997112 : I'm not sure what you mean; if you want line-by-line parsing, use `std::getline()` as you normally would. – ildjarn May 31 '12 at 19:33
  • So I would call getline(s, my_string) to get the first line of the file? – user997112 May 31 '12 at 19:38
  • @user997112 : Yes, and call it again to get the second line, etc. Generally the pattern is `std::string line; while (std::getline(s, line)) { /*parse line*/ }`. – ildjarn May 31 '12 at 19:39
  • getline(s,my_string) is just producing all zeros when my file contains text :s I have checked the file name and everything is correct. – user997112 May 31 '12 at 19:44
  • @user997112 : Sorry, I can't help further without a full self-contained repro. See [SSCCE](http://sscce.org/). – ildjarn May 31 '12 at 19:45
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12002/discussion-between-user997112-and-ildjarn) – user997112 May 31 '12 at 19:47
  • 1
    @user997112 `cout << getline(s,line) << endl; ` (from chat) is incorrect. `getline` does *not* return the string that was read in. Try `while(std::getline(s, line)) { std::cout << line << "\n"; }` – Robᵩ May 31 '12 at 20:02
  • that's silly having to depend on boost, there are single header libraries if you just need memory mapped files – Konrad Mar 02 '19 at 12:00
  • 1
    @Konrad : You're certainly entitled to your opinion, but I don't see how that contributes to this answer. I kept the mmapping code entirely separate from the meat of the answer for a reason: use whatever mmap implementation you want. As for "having to depend on boost" being a negative thing somehow, well, that's pretty stupid in my opinion. – ildjarn Oct 10 '19 at 01:41
3

Abstractly speaking, reading a file sequentially will not be sped up by using memory mapped files or by first reading it into memory. Memory mapped files make sense if reading the file sequentially is not feasible. Pre-caching the file like in the other answer or just by copying the file to a large string which you could then process by other means - again - only makes sense if reading the file once in sequence is not feasible and you have the RAM for it. This is because the slowest part of the operation is actually getting the data off the disk. And this has to be done regardless, whether you copy the file to RAM or you let the operating system map the data before you can access it or when you let std::iostream read it line by line and let it cache from the file just enough to make this work smoothly.

In practice you could potentially eliminate some copying from ram to ram with the mapped or cached versions, by making shallow copies of the buffer ranges. Still this will not change much because this is RAM->RAM and therefore negligible in comparison to disk->RAM.

The best advice in a situation like yours is therefore not to worry too much and just use std::iostream.

[Ths answer is for archival purposes, because the correct answer is buried in the comments]

Wolfgang Brehm
  • 1,491
  • 18
  • 21
1

Is istream the same as mapping a file to memory on windows?

Not exactly. They are not the same in the same sense a "stream" is not a "file".

Think to a file as a stored sequence, and to a stream as the interface for the "channel" (a stream_buffer) that sequence flows when moving from its store towards the receiving variables.

Think to a memory mapped file as a "file" that -instead been stored outside the processing unit- is stored in-sync in memory. It has the advantage to be visible as a raw memory buffer being a file. If you want to read it as a stream, the simplest way is probably using a istringstream that has that raw buffer as the place to read from.

Emilio Garavaglia
  • 20,229
  • 2
  • 46
  • 63
  • To be honest I would rather read the whole file in at once, as opposed to a stream – user997112 May 31 '12 at 19:39
  • @user997112: It depends on what you're gonna do with its content. If the file is a text and you what to read numbers you have to parse it somehow. The std::istream (and derived) are just that parser. – Emilio Garavaglia Jun 01 '12 at 06:13