parsing a mmap()-ed file

Question

What would be the best (fastest) way to parse through a mmap-ed file? It contains pairs of data (string int), but I cannot persume number of whitespaces/tabs/newlines between them.

what do you mean by "parse" - what is stored in the file and what are you trying to do? This question is too vague to answer as it stands... — Nim, Mar 01 '11 at 11:33
@Martin You know that strtok modifies the "source", right? So if you use it on your mmap, you are modifying the file! Is your file text only or binary or what? What type of parsing do you want? — xanatos, Mar 01 '11 at 11:33
Voting to close. It is impossible to understand what is being asked here — Armen Tsirunyan, Mar 01 '11 at 11:35

Nim · Accepted Answer · 2012-10-25T13:05:03.780

3

Assuming you've mmaped the whole file in (rather than chunks - as that would make life awefully complicated), I'd do something like the following...

// Effectively this wraps the mmaped block
std::istringstream str;
str.rdbuf()->pubsetbuf(<pointer to start of mmaped block>, <size of mmaped block>);

std::string sv;
std::string iv;

while(str >> sv >> iv)
{
  // do stuff...
}

I think that should work...

WARNING This is implementation defined behaviour, see this answer for an altogether better approach.

edited Oct 25 '12 at 13:05

answered Mar 01 '11 at 12:00

Nim

33,299
2
62
101

As I recall, `std::basic_streambuf::setbuf` is protected. – ildjarn Mar 01 '11 at 12:06
3

As it turns out, this is implementation-defined: http://stackoverflow.com/a/13059195/636019 – ildjarn Oct 24 '12 at 23:19

ildjarn · Answer 2 · 2012-10-25T01:39:02.233

2

If by best/fastest you mean easiest to code, then this is one of those rare occasions where the deprecated std::istrstream fits the bill perfectly; call the istrstream::istrstream(char const*, std::streamsize) constructor overload then extract the data from the stream as you would from any other std::istream. (This won't duplicate the underlying memory like std::istringstream will.)

If by best/fastest you mean best/fastest runtime performance, I don't think you'll be able to beat boost.spirit.qi or a handwritten parser, though the former would be much easier to write and maintain in my opinion (library learning curve aside, if you've never used boost.spirit before).

edited Oct 25 '12 at 01:39

answered Mar 01 '11 at 11:47

ildjarn

62,044
9
127
211

@Nim When it fits, it fits. :-] – ildjarn Mar 01 '11 at 12:01
@Nim Better, yes, absolutely; "easiest to code", as was my qualification, no. But then I'd regard using boost.spirit.qi "better" than either stream-based route, personally. – ildjarn Mar 01 '11 at 12:12
I'd go with spirit too, but one of my favourite sayings applies here, sometimes all you need is a fly-swatter.. ;) – Nim Mar 01 '11 at 12:28

score 2 · Answer 3 · answered Mar 01 '11 at 13:05

Parsing string/integer pairs (i.e. foo 50 bar 20 baz 123) separated by whitespace should be lightning fast either way. The by far more important factors will be that
a) the pages are actually in RAM, which mmap alone does not guarantee
b) cache lines are in the L1 cache

While mmap does already read ahead by default on sequential access, disk access is in the tens of milliseconds,and parsing over a 4k page of memory is (ideally) in the tens of microseconds.
So, you cannot expect the prefetcher to keep pace, especially since it will only prefetch whenever it looks like you will need more (which, even assuming seek time is zero, practically guarantees an upfront cost due to rotational delay on a mechanical disk).
Therefore, unless your total data is only a dozen kilobytes (in which case the question about how to do it as fast as possible would be pointless, anyway), it makes sense to madvise(MADV_WILLNEED) before you start your scan, so the operating system won't wait to see its heuristics triggered by your access pattern, but reads in sequentially what it can without cease. Disk bandwidth (sequentially), is huge once you're past the access time. You will still probably catch up, but much later. If your dataset is large enough so it will probably not fit into RAM, calling MADV_DONTNEED on data you've already seen every now and then is a good idea.

The same that is true for page faults is true for cache misses. A load from cache is 1-2 cycles, a load from memory is something around 200-500 cycles.
CPUs have automatic prefetching for sequential access patterns, however they are limited.
First, prefetching never occurs across a page boundary. That is because if this were the case, then automatic prefetching would regularly trigger page faults which would be very unpleasing.
Second, prefetching happens only after two consecutive misses, this is to ensure that prefetching really only kicks in when it probably makes sense. Prefetching the adjacent cache lines for every random read would be stupid as it would needlessly trash valuable cache lines.
Third, prefetching takes time, and once the heuristics in the CPU trigger, you're already racing it for the data, so sooner is better than later.
Luckily, you know what data you will be wanting, and you know it a long time ahead. Therefore, you can give prefetch hints, which will give the CPU a valuable head start (prefetch e.g. half a kilobyte ahead).

NPE · Answer 4 · 2011-03-01T11:44:17.760

As it stands, your question is too vague to be answered.

Nonetheless, if all you need to do is to get some data out of the file, what you don't want to do is to use a method that would modify memory in the mmaped region.

Edit It's much clearer now that you've edited the question. As a starting point, I'd use a single char pointer to iterate over the entire mmaped file. Extracting strings is very straightforward (the exact method depends on what you need to do with the result) and the integers can be extracted with atoi et al.

score 0 · Answer 5 · answered Mar 01 '11 at 11:37

0

You could access it via a std::string and use std::istringstream in order to read from it sequentially. Or use some more convenient library, e.g. in Qt you could use a QTextStream on a QByteArray constructed from the mmaped memory.

answered Mar 01 '11 at 11:37

Tilman Vogel

9,337
4
33
32

Yes, I thought of that but wasn't able to mmap it to a std::string. How could I do that? – Marin Mar 01 '11 at 11:39
Sorry, I thought the `string::string ( const char * s, size_t n );` constructor would just reference. But obviously it does a copy (`const char *`...). `QByteArray::fromRawData()` however allows you to operate on an existing chunk of memory. However, you need to guarantee lifetime of that chunk during the lifetime of that `QByteArray`. – Tilman Vogel Mar 01 '11 at 13:22

parsing a mmap()-ed file

5 Answers5

Linked