How to make these IO reads parallel and performant

Question

I have a list of files: List<string> Files in my C#-based WPF application.

Files contains ~1,000,000 unique file paths.

I ran a profiler on my application. When I try to do parallel operations, it's REALLY laggy because it's IO bound. It even lags my UI threads, despite not having dispatchers going to them (note the two lines I've marked down):

Files.AsParallel().ForAll(x =>
{
    char[] buffer = new char[0x100000];
    using (FileStream stream = new FileStream(x, FileMode.Open, FileAccess.Read)) // EXTREMELY SLOW
    using (StreamReader reader = new StreamReader(stream, true))
    {
        while (true)
        {
            int bytesRead = reader.Read(buffer, 0, buffer.Length); // EXTREMELY SLOW
            if (bytesRead <= 0)
            {
                break;
            }
        }
    }
}

These two lines of code take up ~70% of my entire profile test runs. I want to achieve maximum parallelization for IO, while keeping performance such that it doesn't cripple my app's UI entirely. There is nothing else affecting my performance. Proof: Using Files.ForEach doesn't cripple my UI, and WithDegreeOfParallelism helps out too (but, I am writing an application that is supposed to be used on any PC, so I cannot assume a specific degree of parallelism for this computation); also, the PC I am on has a solid-state hard disk. I have searched on StackOverflow, and have found links that talk about using asynchronous IO read methods. I'm not sure how they apply in this case, though. Perhaps someone can shed some light? Also; how can you tune down the constructor time of a new FileStream; is that even possible?

Edit: Well, here's something strange that I've noticed...the UI doesn't get crushed so bad when I swap Read for ReadAsync while still using AsParallel. Simply awaiting the task created by ReadAsync to finish causes my UI thread to maintain some degree of usability. I think this does some sort of asynchronous scheduling that is done in this method to maintain optimal disk usage while not crushing existing threads. And on that note, is there ever a chance that the operating system contends for existing threads to do IO, such as my application's UI thread? I seriously don't understand why its slowing my UI thread. Is the OS scheduling work from IO on my thread or something? Did they do something to the CLR to eat threads that haven't been explicitly affinated using Thread.BeginThreadAffinity or something? Memory is not an issue; I am looking at Task Manager and there is plenty.

Define "extremely" slow. You know that disk reads are about 100,000x slower than reads from RAM right? — aquinas, Nov 08 '14 at 07:32
Are you only checking if the files exist ? If you are I would write my own search. Put all fileNames in a list. Then start at the base directory and do a recursive search through all directories. When a file is found remove it from the list. You can then return a list of unfound files with the files that have not been removed from the list. If you are not trying this you should explain more on what you are trying to accomplish. — deathismyfriend, Nov 08 '14 at 07:33
@deathismyfriend No; I am reading contents. I have a post on the "existential" problem: http://stackoverflow.com/questions/26321366/fastest-way-to-get-directory-data-in-net That's easy, though, and very fast, but no, that's not all I want/need here. — Alexandru, Nov 08 '14 at 07:39
@aquinas Extremely slow: These two lines of code take up ~70% of my entire profile test runs. Cripples my UI. Yes man, everybody knows that disk reads are slower than RAM reads. I find that a little bit insulting LOL — Alexandru, Nov 08 '14 at 07:41
Are these files all on the same disk? Or even on some relatively small number of disks? A million files is a _lot_ of files, and it's going to take a fair amount of time to process that no matter what. Even on an SSD. But you can actually make things _worse_ by attempting to parallelize it too much, by causing contention on the one shared resource: the disk. What kind of performance do you get if you do the work single-threaded? And specifically, how does that performance compare to the rated speed of the disk and controller? — Peter Duniho, Nov 08 '14 at 07:44
@PeterDuniho Performance is MUCH worse single-threaded. Parallelizing makes it 10x faster, at the very least. All files are on the same disk, yes. And yes, contention is for sure happening. Isn't there a way to properly optimize disk contention in .NET? — Alexandru, Nov 08 '14 at 07:46
@Alexandru: you've already gotten a 10x speedup and you're looking for more? Frankly, I'm surprised that with an unsophisticated parallelization approach you managed that kind of improvement. But given that you did, I'd take that and be happy with it. :) (And the question still remains: how does the current throughput compare to the theoretical maximum bandwidth of the disk and controller?) — Peter Duniho, Nov 08 '14 at 07:49
@PeterDuniho The problem is that there is so much disk contention that it lags the UI (and while not even dispatching to it!). — Alexandru, Nov 08 '14 at 07:51
@Alexandru: well, the title of your question doesn't mention the UI (though you briefly touch on it in your post). That said, disk contention doesn't lag the UI; CPU and memory contention does. Anyway, you can limit the degree of concurrency in the `Parallel.ForEach()` by passing a `ParallelOptions` instance where you've set the `MaxDegreeOfParallelism` property. Using a low enough value, you _might_ be able to constrain the I/O side of things enough to leave some CPU left over for your UI. What's "low enough"? You'll have to experiment a bit. — Peter Duniho, Nov 08 '14 at 07:58
Not constraining parallelism for IO-bound tasks is a big mistake. You'll end up with 100 threads all competing for IO and very low CPU utilisation. And then thread pool will spawn more threads. Now you have 101 threads all competing for IO and very low CPU utilisation. Open Windows' Resource Monitor and watch your disk queue. Your UI is probably lagging because accessing the page file is now an exercise in contending with your million threads. — ta.speot.is, Nov 08 '14 at 08:10
@ta.speot.is AsParallel constrains the number of workers acting at a given time by default. Its the minimum between the number of cores you have and another value I can't recall at the moment. ;) — Alexandru, Nov 08 '14 at 08:13
64 http://msdn.microsoft.com/en-us/library/vstudio/dd383719(v=vs.100).aspx At any rate, the 100 threads in my comment incorrect but the point stands. Making threads compete for IO when you're IO bound is bad. Much easier to buy an SSD. http://ark.intel.com/products/75331/Intel-SSD-530-Series-240GB-2_5in-SATA-6Gbs-20nm-MLC *Latency - Read 80 µs* — ta.speot.is, Nov 08 '14 at 08:15
@ta.speot.is Well, here's something I noticed...actually, UI doesn't get crushed so bad when I use ReadAsync instead of Read. Simply awaiting the task to finish causes my UI thread to maintain some degree of usability. I think this does some sort of asynchronous scheduling to maintain optimal disk usage while not crushing existing threads. — Alexandru, Nov 08 '14 at 08:17
I got it working; part of the problem was me not using ReadAsync. The other part was me trying to use an ExtendedObservableCollection with AddRange instead of calling Add multiple times in every UI dispatch...for some reason, the performance of the methods people list in here is actually SLOWER: http://stackoverflow.com/questions/670577/observablecollection-doesnt-support-addrange-method-so-i-get-notified-for-each — Alexandru, Nov 08 '14 at 08:39
Man, I hope this helps out some other people; spent a few days on this trying different things. Somehow it just came to a close tonight. Hallelujah. — Alexandru, Nov 08 '14 at 08:54
"These two lines of code take up ~70% of my entire profile test runs." @Alexandru, that doesn't tell me much though. What's the wall clock time? If I have a a while loop that doesn't do anything, and I say: "Man, this loop takes 100% of the time according to the profiler! How do I speed it up!" that doesn't tell me much does it? :) As other's have said: you should post your theoretical times and the actual time. Anyway, sounds like you solved whatever the issue was. — aquinas, Nov 08 '14 at 16:03

paparazzo · Answer 1 · 2014-11-08T14:12:24.170

I don't agree with your assertion that you can't use WithDegreeOfParallelism because it will be used on any PC. You can base it on number of CPU. By not using WithDegreeOfParallelism you are going to get crushed on some PCs.

You optimized for a solid state disc where heads don't have to move. I don't think this unrestricted parallel design will hold up on regular disc (any PC).

I would try a BlockingCollection with 3 queues : FileStream, StreamReader, and ObservableCollection. Limit the FileStream to like 4 - it just has to stay ahead of StreamReader. And no parallelism.

A single head is a single head. It cannot read from 5 or 5000 files faster than it can read from 1. On solid state the is no penalty switching from file to file - on a regular disc there is a significant penalty. If your files are fragmented there is a significant penalty (on regular disc).

You don't show what the data write but the next step would be to put the write in a another queue with a BlockingCollection in the BlockingCollection. E.G. sb.Append(text); in a separate queue. But that may be more overhead than it is worth. Keep that head as close to 100% busy on a single contiguous file is the best you are going to do.

private async Task<string> ReadTextAsync(string filePath)
{
    using (FileStream sourceStream = new FileStream(filePath,
        FileMode.Open, FileAccess.Read, FileShare.Read,
        bufferSize: 4096, useAsync: true))
    {
        StringBuilder sb = new StringBuilder();

        byte[] buffer = new byte[0x1000];
        int numRead;
        while ((numRead = await sourceStream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
            string text = Encoding.Unicode.GetString(buffer, 0, numRead);
            sb.Append(text);
        }

        return sb.ToString();
    }
}

Very good point. ReadAsync performance remains to be tested on non-SSD drives in my application. For everyone else: He is saying that, basically, on RPM drives, the penalty for context switching scheduled threads with ReadAsync would be detrimental because these drives require mechanical head movements to read sectors. A note on this though, how detrimental *should* depend on the number of cores you have on that PC because of the functionality of workers within AsParallel, so it may not end up being as bad as we might think since the degree of parallelism is limited. — Alexandru, Nov 08 '14 at 17:19
@Alexandru It is not about ReadAsync. It is parallel period. — paparazzo, Nov 08 '14 at 18:54
Yeah, I realize that now from `usr`'s comments on my solution. Whether you Read or ReadAsync, its the same thing if you're doing it through an AsParallel iterator. — Alexandru, Nov 08 '14 at 18:57

score 0 · Answer 2 · answered Nov 08 '14 at 07:42

0

File access is inherently not parallel. You can only benefit from parallelism, if you treat some files while reading others. It makes no sense to wait for the disk in parallel.

Instead of waiting 100 000 time 1 ms for disk access, you program to wait once 100 000 ms = 100 s.

answered Nov 08 '14 at 07:42

Lorenz Meyer

19,166
22
75
121

But is there no scheduling method that optimizes these read times? I am going to try some tricky stuff with async methods. I mean, ideally, there must be some way to optimize read times. There must be, it must be built in. – Alexandru Nov 08 '14 at 07:44
@Alexandru: there is some theoretically possible benefit to having multiple read operations concurrently, as it allows the Windows and the disk controller to schedule reads in a more efficient way. But a) the improvement is not going to be terribly dramatic, and b) if you go too far, you'll wind up with too much contention and less performance. You should be starting with a baseline measurement that you can compare to the theoretical maximum performance given the hardware. Otherwise, there's no way to answer whether what you're seeing is "too slow". – Peter Duniho Nov 08 '14 at 07:47
I agree with Peter. This is really what I meant : There is an incompressible time with disk access, that is due to physical limitations and the way the OS and the controller work. – Lorenz Meyer Nov 08 '14 at 07:53
In your other question you state that you get the data of 500 000 files and 80 000 directories in 3.5 s, that is 7 ms per file and 37 ms per directory. Compare this to the theoretical random read time of your hard disk. I think you're pretty close to touch the ground. – Lorenz Meyer Nov 08 '14 at 08:01

score 0 · Answer 3 · answered Nov 08 '14 at 08:02

Unfortunately, it's a vague question without a reproducible code example. So it's impossible to offer specific advice. But my two recommendations are:

Pass a ParallelOptions instance where you've set the MaxDegreeOfParallelism property to something reasonably low. Something like the number of cores in your system, or even that number minus one.
Make sure you aren't expecting too much from the disk. You should start with the known speed of the disk and controller, and compare that with the data throughput you're getting. Adjust the degree of parallelism even lower if it looks like you're already at or near the maximum theoretical throughput.

Performance optimization is all about setting realistic goals based on known limitations of the hardware, measuring your actual performance, and then looking at how you can improve the costliest elements of your algorithm. If you haven't done the first two steps yet, you really should start there. :)

*Something like the number of cores in your system, or even that number minus one.* The CPU will not be the bottleneck. Setting MaxDOP based on that is a bad idea. — ta.speot.is, Nov 08 '14 at 08:12
The OP is complaining that his UI is being blocked. So something is consuming the available CPU resources. Assuming he's implemented the parallel code correctly (and I grant that without a good code example, this is not guaranteed), that suggests that while the I/O _should_ be the bottleneck, he's somehow managed to keep the CPU busy anyway. And a way around that is to specifically reduce the number of concurrent threads. — Peter Duniho, Nov 08 '14 at 08:42

score 0 · Answer 4 · edited May 23 '17 at 12:27

I got it working; the problem was me trying to use an ExtendedObservableCollection with AddRange instead of calling Add multiple times in every UI dispatch...for some reason, the performance of the methods people list in here is actually slower in my situation: ObservableCollection Doesn't support AddRange method, so I get notified for each item added, besides what about INotifyCollectionChanging?

I think because it forces you to call change notifications with .Reset (reload) instead of .Add (a diff), there is some sort of logic in place that causes bottlenecks.

I apologize for not posting the rest of the code; I was really thrown off by this, and I'll explain why in a moment. Also, a note for others who come across the same issue, this might help. The main problem with profiling tools in this scenario is that they don't help much here. Most of your app's time will be spent reading files regardless. So you have to unit test all dispatchers separately.

Async IO gives you zero perf advantage in this scenario (why would it?). If your code got faster you either changed something else or your measurement is wrong. — usr, Nov 08 '14 at 13:40
@usr You're right, actually. Performance doesn't matter whether I use ReadAsync or not. Just tried it. I thought for some reason ReadAsync uses some sort of "smart scheduling" for this kind of IO bound stuff, but I guess not. Will edit the answer, thanks! It was all the way I was dispatching. — Alexandru, Nov 08 '14 at 17:34

How to make these IO reads parallel and performant

4 Answers4