Alternative to reduce large number of binary files reading access time from hard disk

Question

In my first prototype of application, I have to read around 400,000 files (each 4KB file, around total 1.5 GB data) from hard disk sequentially, and do some operation over the data read from each files, and store the results over RAM. Through this mechanism, I were first accessing I/O for a file and then utilizing CPU for operation, and keep going for another file, but it was very slow process.

To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU). It gave significant improvement.

But in my second phase of development, I have to read 20 GB of data, which now I cannot store in RAM. And, single reading operation with CPU utilization is very time consuming operation.

Can someone please suggest some method to work around this problem?

I am developing this application on Windows in C, with Visual Studio compiler.

It all depends on what exactly you are doing and the correlations you are making. Do you need *all* the data in memory at the same time? — cegfault, Dec 17 '12 at 18:56
It would be better, if I can have all data in memory. But as 20gb cant be store in RAM, I were mainly optimizing the file accessing time. Will any database like SQLite will help for this purpose? — hari, Dec 17 '12 at 19:12

score 4 · Answer 1 · answered Dec 17 '12 at 18:58

4

There's a technique called Asynchronous I/O (AIO) that lets you keep doing some processing with the CPU while a file is read in the background. You can use this to read the next few files at the same time as you're processing a file.

The various AIO calls are OS-specific. On Windows, Microsoft call it "Overlapped I/O". See this Wikipedia page or this MSDN page for more info.

answered Dec 17 '12 at 18:58

user9876

10,954
6
44
66

Apparently not brilliantly. See http://stackoverflow.com/questions/7430959/how-to-make-createfile-as-fast-as-possible Of course, OP needs to try it in his specific case. – user9876 Dec 18 '12 at 10:42

Branko Dimitrijevic · Answer 2 · 2012-12-18T13:21:13.660

To work around, now we first read all the files, and stored all the files data in the RAM, and now doing operation (utilizing CPU).

(Assuming files can be processed independently...)

You are half-way there. Instead of waiting until all files have been loaded to RAM, start processing as soon as any file is loaded. That would be a form of pipelining.

You'll need three components:

A thread¹ that reads files ("producer").
A thread² that processes the files ("consumer").
A message queue³ between them.

The producer reads the files the way you are already doing it, but instead of processing them, just enqueues them to the message queue. The consumer thread waits until it can dequeue the file from the queue, processes it, and then immediately frees the memory that has been occupied by the file and resumes waiting to the queue.

In case you can process files by sequentially traversing them start-to-finish, you could even devise a more fine-grained "streaming", where files wold be both read and processed in chunks, which could lower the peak memory consumption even more (e.g. if you have some extra-large files that would no longer need to be kept whole in the memory).

¹ Or a set of threads to parallelize the I/O, if you anticipate reading from multiple physical disks.

² Or a set of threads to saturate the CPU cores, if processing the file is not cheaper than reading it.

³ You don't need a fancy persistent distributed message queue for that. Just a straight in-memory queue, a-la BlockingCollection in .NET (I'm sure you'll find something similar for pure C).

score 0 · Answer 3 · answered Dec 17 '12 at 21:14

Create threads (in loop) which will read files into RAM.
Work with the data in RAM in separate thread[s] and free RAM after processing.
Keep limits and a poll of records about files (read and processed) in the shared object protected by mutex.
Use semaphore for resources (files in RAM) production/utilisation synchronisation.

Alternative to reduce large number of binary files reading access time from hard disk

3 Answers3