What would be the best (fastest) way to parse through a mmap-ed file? It contains pairs of data (string int), but I cannot persume number of whitespaces/tabs/newlines between them.
-
1what do you mean by "parse" - what is stored in the file and what are you trying to do? This question is too vague to answer as it stands... – Nim Mar 01 '11 at 11:33
-
1@Martin You know that strtok modifies the "source", right? So if you use it on your mmap, you are modifying the file! Is your file text only or binary or what? What type of parsing do you want? – xanatos Mar 01 '11 at 11:33
-
Voting to close. It is impossible to understand what is being asked here – Armen Tsirunyan Mar 01 '11 at 11:35
-
Define _best_: if it the fastest or easiest to code? – Maxim Egorushkin Mar 01 '11 at 11:35
5 Answers
Assuming you've mmaped the whole file in (rather than chunks - as that would make life awefully complicated), I'd do something like the following...
// Effectively this wraps the mmaped block
std::istringstream str;
str.rdbuf()->pubsetbuf(<pointer to start of mmaped block>, <size of mmaped block>);
std::string sv;
std::string iv;
while(str >> sv >> iv)
{
// do stuff...
}
I think that should work...
WARNING This is implementation defined behaviour, see this answer for an altogether better approach.

- 33,299
- 2
- 62
- 101
-
-
3As it turns out, this is implementation-defined: http://stackoverflow.com/a/13059195/636019 – ildjarn Oct 24 '12 at 23:19
If by best/fastest you mean easiest to code, then this is one of those rare occasions where the deprecated std::istrstream
fits the bill perfectly; call the istrstream::istrstream(char const*, std::streamsize)
constructor overload then extract the data from the stream as you would from any other std::istream
. (This won't duplicate the underlying memory like std::istringstream
will.)
If by best/fastest you mean best/fastest runtime performance, I don't think you'll be able to beat boost.spirit.qi or a handwritten parser, though the former would be much easier to write and maintain in my opinion (library learning curve aside, if you've never used boost.spirit before).

- 62,044
- 9
- 127
- 211
-
-
@Nim Better, yes, absolutely; "easiest to code", as was my qualification, no. But then I'd regard using boost.spirit.qi "better" than either stream-based route, personally. – ildjarn Mar 01 '11 at 12:12
-
I'd go with spirit too, but one of my favourite sayings applies here, sometimes all you need is a fly-swatter.. ;) – Nim Mar 01 '11 at 12:28
Parsing string/integer pairs (i.e. foo 50 bar 20 baz 123) separated by whitespace should be lightning fast either way. The by far more important factors will be that
a) the pages are actually in RAM, which mmap alone does not guarantee
b) cache lines are in the L1 cache
While mmap does already read ahead by default on sequential access, disk access is in the tens of milliseconds,and parsing over a 4k page of memory is (ideally) in the tens of microseconds.
So, you cannot expect the prefetcher to keep pace, especially since it will only prefetch whenever it looks like you will need more (which, even assuming seek time is zero, practically guarantees an upfront cost due to rotational delay on a mechanical disk).
Therefore, unless your total data is only a dozen kilobytes (in which case the question about how to do it as fast as possible would be pointless, anyway), it makes sense to madvise(MADV_WILLNEED) before you start your scan, so the operating system won't wait to see its heuristics triggered by your access pattern, but reads in sequentially what it can without cease. Disk bandwidth (sequentially), is huge once you're past the access time. You will still probably catch up, but much later. If your dataset is large enough so it will probably not fit into RAM, calling MADV_DONTNEED on data you've already seen every now and then is a good idea.
The same that is true for page faults is true for cache misses. A load from cache is 1-2 cycles, a load from memory is something around 200-500 cycles.
CPUs have automatic prefetching for sequential access patterns, however they are limited.
First, prefetching never occurs across a page boundary. That is because if this were the case, then automatic prefetching would regularly trigger page faults which would be very unpleasing.
Second, prefetching happens only after two consecutive misses, this is to ensure that prefetching really only kicks in when it probably makes sense. Prefetching the adjacent cache lines for every random read would be stupid as it would needlessly trash valuable cache lines.
Third, prefetching takes time, and once the heuristics in the CPU trigger, you're already racing it for the data, so sooner is better than later.
Luckily, you know what data you will be wanting, and you know it a long time ahead. Therefore, you can give prefetch hints, which will give the CPU a valuable head start (prefetch e.g. half a kilobyte ahead).

- 67,688
- 20
- 135
- 185
As it stands, your question is too vague to be answered.
Nonetheless, if all you need to do is to get some data out of the file, what you don't want to do is to use a method that would modify memory in the mmap
ed region.
Edit It's much clearer now that you've edited the question. As a starting point, I'd use a single char
pointer to iterate over the entire mmap
ed file. Extracting strings is very straightforward (the exact method depends on what you need to do with the result) and the integers can be extracted with atoi
et al.

- 486,780
- 108
- 951
- 1,012
You could access it via a std::string
and use std::istringstream
in order to read from it sequentially. Or use some more convenient library, e.g. in Qt you could use a QTextStream
on a QByteArray
constructed from the mmaped memory.

- 9,337
- 4
- 33
- 32
-
Yes, I thought of that but wasn't able to mmap it to a std::string. How could I do that? – Marin Mar 01 '11 at 11:39
-
Sorry, I thought the `string::string ( const char * s, size_t n );` constructor would just reference. But obviously it does a copy (`const char *`...). `QByteArray::fromRawData()` however allows you to operate on an existing chunk of memory. However, you need to guarantee lifetime of that chunk during the lifetime of that `QByteArray`. – Tilman Vogel Mar 01 '11 at 13:22