c++ read small portions of big number of files

Question

I have a relatively simple question to ask, there has been an ongoing discussion regarding many programming languages about which method provides the fastest file read. Mostly debated on read() or mmap(). As a person who also participated in these debates, I failed to find an answer to my current problem, because most answers help in the situation where the file to read is huge (example, how to read a 10 TB text file...).

But my problem is a bit different, I have lots of files, lets say a 100 million. I want to read the first 1-2 lines from these files. Whether the file is 10 kb or 100 TB is irrelevant. I just want the first one or two lines from every file. So I want to avoid reading or buffering the unnecessary parts of the files. My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place.

What I am doing right know: (I am doing this multithreaded for the moment)

for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
    if (!std::filesystem::is_directory(p)) {
        std::ifstream   read_file(p.path().string());
        if (read_file.is_open()) {
            while (getline(read_file, line)) {
                    // Get two lines here.
            }
        }
    }
}

What does C++, or the linux environment provide me in this situation ? Is there a faster or more efficient way to read small portions of millions of files ?

Thank you for your time.

Info: I have access to C++20 and Ubuntu 18.04

you could get rid of the `if (!std::filesystem::is_directory(p)) {` and let `ifstream` constructor fail if it's a directory. This saves a `fstat` call — Jean-François Fabre, Sep 20 '20 at 18:29
@Jean-FrançoisFabre: `ifstream` will happily "open" a directory.... — Tony Delroy, Sep 20 '20 at 18:33
ifstream introduces too much overhead out of the starting gate. You won't be able to beat an intelligent use of `mmap`, in terms of performance. — Sam Varshavchik, Sep 20 '20 at 18:40
Rockybilly: *"My knowledge was not enough to thoroughly test which method is faster, or to discover what are all my options in the first place."* - sounds like you know about ifstreams and memory mapping - there's lots of example code around for memory mapping (it's even easier if you use boost). There's no substitute for implementing both and taking some performance measurements. It may be the most important thing is tuning your parallelism so the filesystem(s) involved are not wasting time seeking back and forth across the same device, but are working on all devices. — Tony Delroy, Sep 20 '20 at 18:40
The right way to solve this is to index the files when they are written. There's no way to "quickly" access millions of files in any language. — rustyx, Sep 20 '20 at 18:41
@Jean-FrançoisFabre: that's not terribly informative. It does work on my Ubuntu 20.04 LTS machine, with GCC 10.2 and the following code: `if (std::ifstream f{"c++20"}) std::cout << "worked\n";`, where "c++20" is a subdirectory of the current directory. OP mentions Ubuntu. — Tony Delroy, Sep 20 '20 at 18:42
@Jean-FrançoisFabre: for whatever it's worth, on Linux a `getline()` call on the open `ifstream` will fail, so if the number of subdirectories is small compared to the number of files, your idea of treating everything like a file until it fails may still have merit. — Tony Delroy, Sep 20 '20 at 18:46
@SamVarshavchik `mmap()` doesn't help unless the first two lines of the file comprise multiple kilobytes. — EOF, Sep 20 '20 at 18:56
@TonyDelroy even better if `getline` is able to know the different. We don't even need `is_open` — Jean-François Fabre, Sep 20 '20 at 19:07
What is the workflow? Do you just come up to the computer with 100M files and want to process them? Are new files generated while you are traversing the file system? Can you pre-process your files before doing your "reading"? — Vlad Feinstein, Sep 20 '20 at 19:20
@VladFeinstein My program starts inside computer already filled with (for example) 100M files, so indexing when file is created (as mentioned above), is not an option I am afraid. I have to work with existing data. Moreover, the program is in development stage, so in every runtime test, I have to read 100M files again so the program can be tested. Which is why I want to decrease the time. — Rockybilly, Sep 20 '20 at 19:28
The more interesting question is whether this can be parallelized profitably. — EOF, Sep 20 '20 at 19:40
@EOF I am parellizing it to some extent. Currently I am distributing the threads to process the folders containing the files. Which results in some imbalanced distribution (files are not spread equally inside folders). But iterating over all files to distribute them equally to each would increase the workload. But to answer your question, 1 thread reaches about 400 Mb/s doing this work, I was able to increase up to 1.8 Gb/s using multithreading. But I want more :) — Rockybilly, Sep 20 '20 at 19:44
@Rockybilly did you try async I/O? You don't necessarily want more threads, you want more outstanding queued requests. Ideally you probably want each request to have a continuation that checks whether the two lines have been read, and which queues a further read on the file if not. — EOF, Sep 20 '20 at 19:51
@EOF I dont have much knowledge in that area. I will look into that. Currently I am queuing folders to be handled by multiple threads when they are free. — Rockybilly, Sep 20 '20 at 19:55
Then I would recommend looking into it. Disk-intensive applications like databases tend to use async-I/O, and your access pattern may not be entirely different from theirs. You'll probably want to search for Linux aio (or io_uring). — EOF, Sep 20 '20 at 20:00
@EOF wouldn't handling of a hundred million success events be hard ? — Rockybilly, Sep 20 '20 at 22:37
@EOF I was considering if an increased overhead may be the situation. — Rockybilly, Sep 21 '20 at 11:22
@Rockybilly What *kind* of overhead? You're I/O limited (and probably I/O-*latency* limited at that). If you use threads, you add (needless) memory use (mostly for thread stacks). If you instead have an event loop, you save 1) context switches 2) stack memory 3) synchronization 4) sanity. win/win/win/win. — EOF, Sep 21 '20 at 14:48
@EOF Thank you for your input. I should mention this. The machine I am working on has around 40 cores, 120 GB RAM, a good industrial SSD, I have all the resources. I just want to frickin' maximize these resources to the max. — Rockybilly, Sep 21 '20 at 14:51
@Rockybilly No, you do not want to maximize resource usage (unless you are insane), you want to maximize performance. If your performance bottleneck is I/O latency, adding threads is the wrong approach, regardless of whether you have plenty of processor cores idle. — EOF, Sep 21 '20 at 14:53
@EOF That's what I meant man :| of course I don't want to waste resources, what I mean is, if the SSD can read up to 3.0 Gb/s, I don wan't my program to use 1 Gb/s of this amount. So it is a matter of efficacy not a burnout test :D, I will try your suggestions, thank you. — Rockybilly, Sep 21 '20 at 15:00

score 2 · Answer 1 · answered Sep 20 '20 at 19:06

You can save one underlying call to fstat by not testing if the path is a directory, and then rely on is_open test

#include <iostream>
#include <fstream>
#include <filesystem>
#include <string>

int main()
{
 std::string line,path=".";
 for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
 { 
        std::ifstream   read_file(p.path().string());
        if (read_file.is_open()) {
        std::cout << "opened: " << p.path().string() << '\n';
           while (getline(read_file, line)) {
                    // Get two lines here.
            }
        }
    }
}
}

At least on Windows this code skips the directories. And as suggested in comments is_open test can even be skipped since getline doesn't read anything from a directory either.

Not the cleanest, but if it can save time it's worth it.

score 0 · Answer 2 · answered Sep 20 '20 at 19:34

Any function in a program accessing a file under Linux will result in calling some "system calls" (for example read()).

All other available functions in some programming language (like fread(), fgets(), std::filesystem ...) call functions or methods which in turn call some system calls.

For this reason you can't be faster than calling the system calls directly.

I'm not 100% sure, but I think in most cases, the combination open(), read(), close() will be the fastest method for reading data from the start of a file.

(If the data is not located at the start of the file, pread() might be faster than read(); I'm not sure.)

Note that read() does not read a certain number of lines but a certain number of bytes (e.g. into an array of char), so you have to find the end(s) of the line(s) "manually" by searching the '\n' character(s) and/or the end of the file in the array of char.

Unfortunately, a line may be much longer than you expect, so reading the first N bytes from the file does not contain the first M lines and you have to call read() again.

In this case it depends on your system (e.g. file system or even hard disks) how many bytes you should read in each call to read() to get the maximum performance.

Example: Let's say in 75% of all files, the first N lines are found in the first 512 bytes of the file; in the other 25% of all files, first N lines are longer than 512 bytes in sum.

On some computers, reading 1024 bytes at once might require nearly the same time as reading 512 bytes, but reading 512 bytes twice will be much slower than reading 1024 bytes at once; on such computers it makes sense to read() 1024 bytes at once: You save a lot of time for 25% of the files and you lose only very little time for the other 75%.

On other computers, reading 512 bytes is significantly faster than reading 1024 bytes; on such computers it would be better to read() 512 bytes: Reading 1024 bytes would save you only little time when processing the 25% of files but cost you very much time when processing the other 75%.

I think in the most cases this "optimal value" will be a multiple of 512 bytes because most modern file systems organize files in units that have a multiple of 512 bytes.

Currently I am doing `getline(read_file, line)`. But I can say that the data I want is in the beginning of the file, in the first line, and fortunately the first line is indeed short (contains only the data I need). Thats why I wanted to device a strategy to better handle my situation than simple file reading. — Rockybilly, Sep 20 '20 at 19:52
@Rockybilly If you are sure that the information is always found in the first `N` bytes of the file and `N<=512`, the sequence `open()`, `read(...,N)`, `close()` seems to be the fastest possibility you have. If speed is important, you should also consider not to use the data type `string` but to work on the array of `char`. You should avoid using dynamic memory allocation (`new`, `malloc()`...) when processing each file but allocate all resources you need only once at the beginning. — Martin Rosenau, Sep 21 '20 at 05:26

Vlad Feinstein · Answer 3 · 2020-09-20T20:40:53.207

0

I was just typing something similar to Martin Rosenau answer (when his popped up): unstructured read of the max length of two lines. But I would go further: queue that text buffer with corresponding file name and let another thread parse / analyze that. If parsing takes about the same time as reading, you can save half of the time. If it takes longer (unlikely) - you can use multiple threads and save even more.

~~Side note - you should not parallelize reading (tried that).~~

It may be worth experimenting: can you open one file, read it asynchronously while proceeding to open the next one? I don't know if any OS can overlap those things.

edited Sep 20 '20 at 20:40

answered Sep 20 '20 at 19:43

Vlad Feinstein

10,960
1
12
27

1

I was able to parallelize this because files are distributed among some folders ( removes the need for knowing the files first to share them between threads), I was able to increase disk_read up to 6x. But the reason I opened this question, I could be reading more than necessary from the files, so my ssd could be spending some resources on unnecessary operations. Wanted to know if something like an magical low-level system call existed for this (or something more logical of course) :) – Rockybilly Sep 20 '20 at 19:48
*How* have you "tried that"? Lots of systems today are optimized for many concurrent accesses to storage (modern ssds can have tens of thousands of concurrent accesses). – EOF Sep 20 '20 at 19:49
@EOF - I timed a multi-threaded read on my iMac (flash storage), and didn't measure any benefits. If that matters, I've done it using Go's concurrency. – Vlad Feinstein Sep 20 '20 at 20:31
How does a single failed test of unspecified read concurrency on MacOS in Go relate to hundreds of thousands of files to be read on LInux in C++? – EOF Sep 20 '20 at 20:34
@EOF - withdrawn. But are my other suggestions valid? – Vlad Feinstein Sep 20 '20 at 21:13
I'm not a great fan of throwing threads at problems, especially I/O-bound problems. I consider it a terrible idea, in fact. I'd recommend async I/O (which may involve handing tasks to a threadpool internally, but not spawning raw threads willy-nilly) instead. – EOF Sep 20 '20 at 21:17
On a computer with multiple CPUs or multiple CPU cores this would make sense: One core would call the system calls while the other core processes the data read... – Martin Rosenau Sep 21 '20 at 05:31
@Rockybilly Most (hardware) devices store data in form of "blocks" or "sectors" and you can only read whole blocks. If the block size is 512 bytes (what was the normal case for a long time), Linux can only read a multiple of 512 bytes. More modern hardware uses 4096 bytes instead of 512 bytes. However, Linux will typically read multiple blocks and write the "rest" to the cache. I doubt that you have a possibility to change this. However, if the files that you process are smaller than 512 bytes and they are located next to each other on the disk, this "unnecessary step" will even save time! – Martin Rosenau Sep 21 '20 at 05:37
@Rockybilly In [another question](https://stackoverflow.com/questions/15266115/read-file-without-disk-caching-in-linux) the answer says that `posix_fadvise()` might be used... – Martin Rosenau Sep 21 '20 at 05:41

c++ read small portions of big number of files

3 Answers3