3

Read files means I will read every document (doc, docx, xls, xml, txt,...) on my hard disk.

Most of my files will be about 10KB ~ 1MB, I think.

I'll read the file and filter the text if there is any specific words.

So my guess is I should have thread pool and 1 thread on reading files and other threads doing the filtering.

I heard there's MMF, CreateFile/ReadFile or I/O completion port to read the each files.

What function should I use?

unkulunkulu
  • 11,576
  • 2
  • 31
  • 49
Young Hyun Yoo
  • 598
  • 10
  • 21

3 Answers3

4

In my tests, memory mapping the file is the fastest way to load the content into memory, by a small margin.

The test I perfomed were on Linux, but since the method of loading a file into a memory mapped region is copying the data in a page at a time, into memory that is owned by the OS [memory mapped files backing memory is owned and handled completely by the OS, so the OS has the ability to "lock" that memory in place, etc, etc]. This is quicker than reading a piece of file into a kernel buffer and then copying that content into the buffer provided by the application, since it avoids one copy. However, for large files (or many small files), the main limiting factor is still "how quickly can the hard-disk deliver data" - which for my system is around 60MB/s. You can make it slower than what the system produces, but not faster.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
1

For pure IO speed, you might want to try CreateFileMapping and MapViewOfFile. I've not measured this under Windows, but using similar techniques under Linux can result in a significant speed up.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
-1

There is no "fastest" method for reading I/O. You can't get any faster than fread or equivalents. Using threads will not help you, because hard drive I/O will be the main bottleneck anyway.

When bulk reading all the files in your harddrive, your speed will ultimately depend on the speed of your harddrive. It is likely that 95% of the time will be spent waiting on I/O so multi-threading will at most improve speed by 5-6%, but will do nothing like make your program run twice as fast.

sashoalm
  • 75,001
  • 122
  • 434
  • 781
  • 2
    That doesn't correspond to my measures (done on Linux). You can definitely beat `fread`. And even if hard drive IO will be the main bottleneck, analysing some of the formats he mentions will require some CPU as well, so there will be some gain by using threads. – James Kanze May 08 '13 at 10:14
  • 1
    Will it make his program twice as fast? Or will he optimize it to run 5% faster at great effort? The way I see it, he is (wrongly) thinking that he can significantly speed up his program by using esoteric functions, when in fact, reading all the files on your harddrive can't be optimized beyond `fread`. – sashoalm May 08 '13 at 10:17
  • It depends on the implementation of `fread`, but generally, speedups of 50% or more aren't unusual. – James Kanze May 08 '13 at 10:31
  • Does this depend on lazy reading? In his use case he will parse the entire file contents anyway, searching for words, if I understand correctly. I assumed that memory mapping will not be faster if you will read the entire contents anyway. – sashoalm May 08 '13 at 10:33
  • 1
    I'm not sure what you mean by lazy reading, but... I'll admit that the results surprised me some myself. Under most Unix, when you open a file, the system will read-ahead as long as you don't seek, so you should get some implicit parallelization (at a cost of extra copies between buffers). When a file is mmapped, I don't think that there is a read until you get a page fault, which stalls the thread in question (but I've not looked at any recent Unix sources to be sure). Measures are measures, however, and using `mmap` is significantly faster than and of `read`, `fread` or `std::ifstream`. – James Kanze May 08 '13 at 10:47
  • I did some tests on `fread` versus `read` a while ago, and asked this question: http://stackoverflow.com/questions/13171052/what-goes-on-behind-the-curtains-during-disk-i-o – paddy May 10 '13 at 00:23
  • I looked into this over a decade ago (on Windows NT 3.5), and found that (for software source files at least), my speed seemed to be mostly proportional to how many I/O's I performed. In other words, allocating enough memory and then reading in the entire file at once turned out to be faster than reading it in fixed-sized batches, and far far faster than dumbly calling the built-in language I/O routines every time I wanted a new line. – T.E.D. Dec 30 '13 at 22:31