Fastest way to read a text file of strings line by line

Question

Possible Duplicate:
What is the Fastest Method for High Performance Sequential File I/O in C++?

I have looked around a little bit and I am still not sure of the answer to this question.

When reading from a text file with an arbitrary word on every line, what would be the absolute fastest way of reading the words from that file? The scope of the project requires the fastest possible file read.

Using Visual Studio on Windows 7. No cross platform consideration.

Edit: Keep in mind, this file read is a one time thing, it will not be read from again and it will not be written to. The program starts, reads from the file, pushes it into a data structure and the loadFile() function is never called again.

Possible duplicate: http://stackoverflow.com/q/1201261/365102 — Mateen Ulhaq, Feb 20 '12 at 04:38
Not sure if memory mapping is applicable to the situation I am in. — that_guy, Feb 20 '12 at 05:20
@MSalters: C'mon, the other question asks about "I/O". Both reads and writes. — Ben Voigt, Feb 23 '12 at 21:23

score 4 · Answer 1 · answered Feb 20 '12 at 04:41

The fact that you have this tagged "multithreading" makes me think that you're considering a threaded read on the file. I'd really really recommend you reconsider, as this will cause very hairy concurrency issues to rear their ugly heads. You'll have to delve deep into the rabbit hole of mutexes, semaphores and inter-process communication, which can make even the best developers weep for the good old days before threads.

You have a .txt file, and you have words in that file to read. You have to open the file, and you have to read every word. There's just no getting around it. Unless you're willing to process the text file into a data structure made for concurrent access (intel TBB has some good ones) your best bet might be to just do a single-threaded read and pass data to other threads after everything is local.

Ok, thanks for the suggestion, " You have to open the file, and you have to read every word. There's just no getting around it." now my question is, what is the fastest way of doing that? I already have a data structure that I am pushing these words into, so that is not a concern at this point in time. — that_guy, Feb 20 '12 at 05:16

score 2 · Answer 2 · answered Feb 20 '12 at 04:39

2

Either memory-map the file or read it in large fixed-sized chunks and process the data in memory.

answered Feb 20 '12 at 04:39

David Schwartz

179,497
17
214
278

Which is suggested in great detail in the linked question. – Ben Voigt Feb 20 '12 at 04:42
1

...and use `FILE_FLAG_SEQUENTIAL_SCAN` when opening the file, thereby giving the OS hints as to how and when it can evict the file's contents from its caches. – reuben Feb 20 '12 at 04:43
@Reuben As it happens, windows assumes sequential access until it sees non-sequential accesses. – David Schwartz Feb 20 '12 at 04:46
@DavidSchwartz Are you suggesting that this flag is a no-op? Windows documentation seems to imply otherwise. – reuben Feb 20 '12 at 04:47
Why would memory mapping the file have the fastest performance? This is a one-time only read, fresh program execution. From my understanding of memory mapping this would only be beneficial if I where to go back and use the file again.(after its been mapped) – that_guy Feb 20 '12 at 05:14
@Reuben No, it's not a no-op. If you set sequential access, Windows will assume sequential access no matter what. If you don't set it, Windows will assume sequential access until it sees a non-sequential access. If you only access the file sequentially, then there's no difference between setting the flag and not setting it. – David Schwartz Feb 20 '12 at 05:14
@that_guy It's not obvious, but if you look very closely at what your code actually does, you'll see it winds up reading more than once. For example, you typically have to read once to find the end of line characters and then read again to actually process the data. But the main point is that memory mapping avoids having to copy the data. The `read` function will typically copy the file data from the kernel's buffer (it has already read ahead) to the application's buffer. Memory mapping allows the application to read out of the kernel's buffer without that copy. – David Schwartz Feb 20 '12 at 05:16
Ok cool, could you possibly point me in the direction of some code, memory mapping and possibly an accompanying file read to go with it? – that_guy Feb 20 '12 at 05:24
Start with [MSDN](http://msdn.microsoft.com/en-us/library/ms810613.aspx). – David Schwartz Feb 20 '12 at 05:27
Totally unrelated, but I just saw you [here](https://bitcointalk.org/index.php?topic=30823.0). ...And [here](http://bitcoin.stackexchange.com/questions/2072/its-impossible-to-gpu-mine-without-opencl). Weird. :) – Mateen Ulhaq Feb 20 '12 at 07:35
2

See also [Raymond Chen's inside explanation](http://blogs.msdn.com/b/oldnewthing/archive/2012/01/20/10258690.aspx) – MSalters Feb 20 '12 at 08:51

score 1 · Accepted Answer · answered Feb 20 '12 at 07:17

As I understand your question your objective is to read a file of words and insert each word into some data structure. You want this read+insertion to be as fast as possible. (I won't debate the rationale for or the wisdom of this, I'll just accept is as a requirement. :-) ) If my understanding is correct, then perhaps an alternative approach would be to write a utility program that will read the file of words, insert them into the data structure, and then serialize that data structure to a file (say BLOB.dat, for example). Then your main program will deserialize BLOB.dat into the data structure that you require. Essentially you pre-process the words file into some intermediate binary format that can be loaded into your data structure most efficiently. Or would this be cheating in your scenario??

Good thinking! however that would be a bit cheating in this scenario ;) — that_guy, Feb 20 '12 at 18:37

score 0 · Answer 4 · answered Feb 20 '12 at 08:56

0

Do not memory map the file. As Raymond Chen explains, that kills the sequential access optimization. Since disks are slow, prefetching will keep the disk busy and therefore the throughput higher.

answered Feb 20 '12 at 08:56

MSalters

173,980
10
155
350

Raymond doesn't say that memory mapping is slower, he just says it doesn't go through the cache manager. – Ben Voigt Feb 23 '12 at 21:22

score 0 · Answer 5 · answered Feb 20 '12 at 22:16

Your file will probably load itself as fast as it is able to. After all most file operations just call the same system calls. IOstreams is said to be slower than cstdio, but I suggest you use a profiling tool here to find the best set of options here. Tweak the buffer size to match your need. But, unfortunately, with large files most of the time you will spend waiting for IO, only a minuscule time is used for processing. Tweaking how you load won't buy you much.

But since you are going to wait make sure that you use your time wisely.

Spawn a thread to load the file immediately when the application starts, and use that time time to do anything else. If you need the data to do anything, pass chunks of the read file to the other thread to process.

Fastest way to read a text file of strings line by line

5 Answers5

Linked