How to optimize performance in a simple TPL DataFlow pipeline?

Question

Given:

Hundreds of .NET projects
Thousands of C# files across all the projects
A string literal

I want to output all the matches of the given literal in all the files across all the projects. I would like to use this example in order to understand how to optimize performance of a simple TPL DataFlow pipeline.

The complete code is committed in github - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs

The pipeline itself is:

private void Run(string workspaceRoot, string literal, int maxDOP1 = 1, int maxDOP2 = 1)
{
    var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();

    var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(YieldCSFiles, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDOP1 });
    var produceMatchingLines = new TransformManyBlock<CSFile, MatchingLine>(csFile => csFile.YieldMatchingLines(literal), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDOP2 });
    var getMatchingLines = new ActionBlock<MatchingLine>(o => Console.WriteLine(o.ToString(workspaceRoot)));

    var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };

    produceCSFiles.LinkTo(produceMatchingLines, linkOptions);
    produceMatchingLines.LinkTo(getMatchingLines, linkOptions);

    Console.WriteLine($"Locating all the instances of {literal} in the C# code ... ");
    var sw = Stopwatch.StartNew();

    projects.ForEach(p => produceCSFiles.Post(p));
    produceCSFiles.Complete();
    getMatchingLines.Completion.Wait();

    sw.Stop();
    Console.WriteLine(sw.Elapsed);
}

Here are some notes:

It is very cheap to obtain ProjectEx objects.
Accessing the property ProjectEx.MSBuildProject for the very first time is quite expensive. This is where Microsoft Build API evaluates the respective csproj file.
After the evaluation getting the list of CS files is very cheap, but processing them all is quite expensive, because there are so many of them.

I am not sure how to depict the pipeline graphically here, but:

produceCSFiles is fed cheap ProjectEx objects and outputs a lot of CSFile objects, which is expensive due to project evaluation.
produceMatchingLines is fed CSFile objects and outputs the matching lines, which is expensive due to sheer quantity of CSFile objects and the amount of line to process.

My question - is my implementation optimal? I have doubts, because increasing maxDOP1 and maxDOP2 does not yield too much of an improvement:

C:\work\TPLDataFlow [master ≡ +0 ~2 -0 !]> 1..4 |% { $MaxDOP1 = $_ ; 1..4 } |% { $MaxDOP2 = $_ ; $res = .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\dayforce\tip -l GetClientLegalPromptFlag --maxDOP1 $MaxDOP1 --maxDOP2 $MaxDOP2 -q ; "$MaxDOP1 x $MaxDOP2 --> $res" }
1 x 1 --> Elapsed: 00:00:21.1683002
1 x 2 --> Elapsed: 00:00:19.8194133
1 x 3 --> Elapsed: 00:00:20.2626202
1 x 4 --> Elapsed: 00:00:20.4339065
2 x 1 --> Elapsed: 00:00:17.6475658
2 x 2 --> Elapsed: 00:00:15.4889941
2 x 3 --> Elapsed: 00:00:14.9014116
2 x 4 --> Elapsed: 00:00:14.9254166
3 x 1 --> Elapsed: 00:00:17.6474953
3 x 2 --> Elapsed: 00:00:14.4933295
3 x 3 --> Elapsed: 00:00:14.2419329
3 x 4 --> Elapsed: 00:00:14.1185203
4 x 1 --> Elapsed: 00:00:19.0717189
4 x 2 --> Elapsed: 00:00:15.9069517
4 x 3 --> Elapsed: 00:00:16.3267676
4 x 4 --> Elapsed: 00:00:17.0876474
C:\work\TPLDataFlow [master ≡ +0 ~2 -0 !]>

What I see is:

Max improvement is when maxDOP1 == 3 and maxDOP2 == 4 - 14.12 sec vs 21.17 sec
Max ROI is when maxDOP1 == 2 and maxDOP2 == 3 - 15 sec vs 21.17 sec

All in all just 30% improvement over the single threaded version. This is a bit disappointing, because all the files are on the SSD and I have 12 logical processors. And, of course, the code is much more complicated.

Am I missing anything? Maybe I am not doing it in an optimal fashion?

Theodor Zoulias · Accepted Answer · 2021-10-05T02:34:46.097

1

This architecture is not optimal, because each of the worker blocks, the produceCSFiles and the produceMatchingLines, is doing mixed I/O-bound and CPU-bound work. Ideally you would like to have a block dedicated at doing exclusively I/O-bound, and another one doing exclusively CPU-bound work. This way you would be able to configure optimally the degree of parallelism of each block, according to the capabilities of the associated hardware component. With your current configuration it is entirely possible that at a given moment both blocks are doing I/O work, competing with each other for the SSD's attention, while the CPU is idle. And at another moment the exact opposite could be happening. The result is a chaotic and uncoordinated hubbub. Which is similar with what you would get if you used a monolithic Parallel.ForEach loop, which would probably yield comparable (mediocre) performance improvements over a single-thread approach.

Something else that you should have in mind is that the TPL Dataflow performs well when the messages passed from block to block are chunky. As the introductory document says: "provides in-process message passing for coarse-grained dataflow and pipelining tasks" (emphasis added). If the processing of each individual message is too lightweight, you'll end up with significant overhead. If you need to, you can chunkify your workload by batching the messages, using BatchBlock<T>s, the Chunk LINQ operator, or other means.

Having said all that, my assumption is that your work is disproportionately I/O bound, rendering less relevant the capabilities of your CPU. Honestly I wouldn't expect massive performance improvements, even with the most sophisticated implementation.

edited Oct 05 '21 at 02:34

answered Oct 04 '21 at 23:44

Theodor Zoulias

34,835
7
69
104

1

Sounds like you are suggesting to have separate workload to read the file contents into line arrays and then have separate workload to process batches of line arrays. This would correspond to separating IO from CPU in the `produceMatchingLines`. Doing so in `produceCSFiles` is also possible if I read the project files separately and use `Project(XMLReader xmlReader)` instead `Project(string projectFile)`. It is worth checking. Though it would generate IO --> CPU --> IO --> CPU. I will not be able to isolate all the IO from all the CPU – mark Oct 05 '21 at 01:00
@mark yeap, that's what I am suggesting. If you end up with more that one I/O blocks, it might worth experimenting with configuring all of them with the same [`ConcurrentScheduler`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.concurrentexclusiveschedulerpair.concurrentscheduler), in order to enforce a common I/O concurrency policy. From my small experience with SSD's, these devices react positively to a little bit of parallelism (~2), while classic hard disk drives perform best when they are doing one thing at a time. – Theodor Zoulias Oct 05 '21 at 02:27
Unfortunately msbuild does not make it easy. So far I failed to create project objects from memory. Posted a question here - https://stackoverflow.com/questions/69454500/how-to-evaluate-a-microsoft-build-evaluation-project-object-created-from-memor – mark Oct 05 '21 at 16:59
@mark I wish I could help, but I barely know anything at all about MSBuild. – Theodor Zoulias Oct 05 '21 at 17:38
I have changed the code to deal with multiple string literals, more questions - https://stackoverflow.com/questions/69459634/how-to-search-vast-code-base-for-multiple-literal-strings-efficiently-with-tpl-d. Thank you. – mark Oct 06 '21 at 03:54

How to optimize performance in a simple TPL DataFlow pipeline?

1 Answers1

Linked