0

I have an application that loads files and processes data. Let's assume I have like 10...20 files to process.

some requirements, to make the question clearer:

  • files are small, maybe a few MB max
  • there might be a dozen files, maybe a hundred
  • one example might be parsing CSV data, or JSON, loading game 3d models

One idea is to use some thread pool and process files in parallel. Is this efficient? Can my operating system handle file access from multiple threads?

I found this question: Accessing a single file with multiple threads

But in my application one thread would access its "own" file, so there wouldn't be any collisions.

In my application, I'm using C++/STL, but I'd like to know the general opinion about filesystems on Linux and Windows.

fen
  • 9,835
  • 5
  • 34
  • 57
  • Are you sure that in your particular case, that loading takes a lot of time? I'm guessing that it is so quick that your average user don't even notice it. – Basile Starynkevitch Jan 04 '19 at 08:04
  • I'm asking a bit general question here, so in my simple application, the file access is fast (especially if the files are in a system cache). But I was thinking about the general mechanisms here. – fen Jan 04 '19 at 08:07

2 Answers2

1

You need to benchmark. (probably in your case it could be worth to use several threads; however in your case, the loading should be so quick, even done sequentially, that your average user won't notice)

In many cases, when you deal with medium sized files (e.g. less than a dozen of megabytes each, or perhaps even half a gigabyte each) which have been accessed recently, these files practically sit in the page cache. So you won't access the disk itself, and your program practically works in RAM (and then multithreading should be effective).

BTW, Linux has readahead(2), posix_fadvise(2), madvise(2) to hint the kernel virtual memory subsystem (that is, to give hints to the page cache).

If your common use case is accessing the disk itself (e.g. because the files are quite big, or because you have not accessed them recently before, so they are not in the page cache), then multi-threading won't help, because the bottleneck becomes the hardware disk.

Remember that a disk (even an SSD one) is many thousands times slower than RAM and it does IO operations sequentially.

Also, you may spend some amount of CPU time in parsing the files. If that takes a significant amount of CPU, it is worth to be run in several independent threads.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • thanks! the files I'm considering are small, a few MB maybe. another example is the parallel compilation: compilers can do that and invoke a compiler for each file basically. So it should be possible – fen Jan 04 '19 at 07:52
  • Parallel compilation (e.g. `make -j` with C++ & GCC) involves independent processes (GCC itself is not multithreaded) and most of the time the C++ source files are in the page cache (e.g. because you have saved them recently in your editor) – Basile Starynkevitch Jan 04 '19 at 07:53
  • Also GCC don't spend much time in the IO proper. Run it with `g++ -ftime-report` – Basile Starynkevitch Jan 04 '19 at 08:01
1

In my experience you get more performance if the processing of the data is heavy. In this case you really make parallel the execution of your program. You also need to know how many core your cpu have. It is not worth have more threads than cpu cores. If your processing is "light", probably your threads are always waiting of disk to complete reading, with little, if ever, gain in performance.

alangab
  • 849
  • 5
  • 20
  • yes, so thread pools, system thread pools usually allocate worker threads that match core counts (plus minus extra threads). So I have 6 cores and 100 files, I'll still be able to process probably max 6 files at a time. Not 100. – fen Jan 04 '19 at 08:03