3

I'm running a CPU with 24 threads (5900X), spinning up 20 tasks to do an action that should be entirely CPU bound yet the CPU load peaks at 10% maximum. Trying to see if someone can shed some light on whether this is me misunderstanding how tasks thread themselves, or if the library (HtmlAgilityPack) that's doing the processing is at fault?

Here is a somewhat convoluted example:

public async static Task TestHtmlAgilityPack(bool loadHtml = true)
{
    // "basePath" is a folder has approx 20 folders each containing approx 3000 files (20 tasks * 3,000 files = 60k overall)
    var dirs = Directory.GetDirectories(basePath);
    List<Task> tasks = new();
    var strs = new ConcurrentBag<string>();
    foreach (var dir in dirs)
    {
        tasks.Add(Task.Run(() =>
        {
            foreach (var file in Directory.GetFiles(dir, "*.html")) // Each of the 20 tasks processes approx 3000 files
            {
                var html = File.ReadAllText(file);
                strs.Add(html.Substring(1, 1000));
                if (loadHtml)
                {
                    var doc = new HtmlDocument();
                    doc.LoadHtml(html);
                }
            }
        }));
    }
    await Task.WhenAll(tasks);
    Console.WriteLine(strs.Last());
}

If I run it without LoadHtml it completes in 15 seconds, so the IO access time is insignificant. With LoadHtml it now takes 20 minutes, I understand parsing HTML into a queryable form will take time, that's fine/expected, but what's confusing is it (should?) be a purely CPU intensive operation, it's not waiting for anything. Why is the CPU peaking at 10% rather than using closer to the ~80% you'd expect from loading up 20 threads with a CPU intensive operation on a 24 thread CPU?

Would this indicate an inefficiency in the LoadHtml method or something else?

Zak123
  • 363
  • 2
  • 9
  • 1
    You are not using "ASYNC" so each task is blocking until task is completed. – jdweng Jul 31 '22 at 09:46
  • @jdweng not using async where? Each of the 20 tasks is supposed to be doing one long CPU-intensive operation (processing 3000 files) on its own thread. I don't see why they would block eachother during the run, only at the end when I wait for them all to finish which is what I want? – Zak123 Jul 31 '22 at 09:59
  • 1
    Task do not run async automatically. The Tasks are separate threads but run to completion before next thread is started. See following : https://stackoverflow.com/questions/20304258/run-async-method-on-a-background-thread – jdweng Jul 31 '22 at 10:02
  • @jdweng That accepted answer for the link you gave specifically says `Task.Run` does run the operation in a separate thread? It's not running the tasks one after the other I can see which order it processes the files in and all tasks are running at once. Any chance you could give an example of what you think should be done, because I still don't see why these should block eachother? – Zak123 Jul 31 '22 at 10:11
  • You need to use WaitAll. See : https://stackoverflow.com/questions/54333684/how-to-make-use-task-waitall-with-async-method-using-c-sharp – jdweng Jul 31 '22 at 10:13
  • `Task.WaitAll` takes the same time (20 mins) as `await Task.WhenAll` in my example above. I'm already awaiting the `TestHtmlAgilityPack` when I call it for the test so whether it uses `await Task.WhenAll` or `Task.WaitAll` shouldn't matter – Zak123 Jul 31 '22 at 10:21
  • 4
    https://github.com/zzzprojects/html-agility-pack/issues/191 – Hans Passant Jul 31 '22 at 11:24
  • @jdweng comments are intended for asking for any missing information in the question, [not for answering it](https://prnt.sc/RfK7kkKxohGr "Use comments to ask for more information or suggest improvements. Avoid answering questions in comments."). If you have an answer for this question, you could post it as an answer, so that it can be evaluated (voted) according to its merits. – Theodor Zoulias Jul 31 '22 at 12:41
  • 2
    @HansPassant Thank you!! My google-fu must have been weak, didn't find that issue. Not sure what the full consequences of this change are but I switched the garbage collector to server from the default and now it's 10-15x faster. The threaded HTML analysis task I run daily that used to take 37 minutes now takes 3 minutes, I was hoping to get a bit of a speed increase when asking this question but didn't expect anywhere near this! Thanks again – Zak123 Jul 31 '22 at 13:38
  • @Zak123 the example isn't convoluted but does demonstrate what's wrong here. You're combining both IO and CPU operations in the same task. Reading files doesn't require a lot of IO but *does* block the thread. If you used asynchronous operations you'd avoid this. As a general rule you need to separate IO from CPU workloads. You can't execute more CPU-bound tasks at the same time than you have cores, but you *can* easily do that with IO-bound *async* tasks. – Panagiotis Kanavos Aug 06 '22 at 09:20
  • @Zak123 another problem is the high RAM usage of your code. Not the garbage collector. `File.ReadAllText(file)` reads everything in memory (in a blocking way). Then `html.Substring(1, 1000)` generates a *new* string for every file, when all you needed was to skip 1 character and read the next 1000. Which is a rather weird thing to do and result in invalid HTML from valid HTML files – Panagiotis Kanavos Aug 06 '22 at 09:23
  • @Zak123 what's the *actual* problem you want to solve? I can point to several duplicates that show how to fix the question's problem but the use of eg `strs.Last()` or `Substring(1,1000)` suggest the real problem is more complex. You could use a `ConcurrentStack` instead of a `ConcurrentQueue` to get only the last page, but why parse all of them in that case? – Panagiotis Kanavos Aug 06 '22 at 09:34
  • @Panagiotis Kanavos Sorry only just saw this, the actual code I'm using is nothing like the example I gave above, that was just a barebones example of how I could reproduce the bottleneck, perhaps 'convoluted' was wrong word. The real code is caching a 1mil+ pages into SQL through the night and then performing analysis on it later in the day. Just HtmlAgilityPack xpath queries, no substrings / concurrent storage, etc. The IO is fast, the bottleneck purely seems to be coming from the initial memory inefficient HtmlAgilityPack tree parse. Thanks for your reply though still good information – Zak123 Aug 15 '22 at 16:57
  • The workaround of switching to the server GC is a good enough bandaid for now (immediately gave a 15x boost in performance which brought it in-line to where I thought it should be), eventually I'll look into the HtmlAgilityPack library and see if I can make the parse more memory efficient, at that point I expect your info about ArrayPool / efficient buffers will come in very handy – Zak123 Aug 15 '22 at 17:07

1 Answers1

1

There are several issues with this code that limit its scalability.

  • it performs both IO and CPU work in the same task without using asynchronous methods. The number of CPU-bound tasks you can perform at the same time is limited by the number of cores. You can perform a lot more async tasks than that.
  • IO is blocking, not asynchronous. This means the task (or rather its thread) waits doing nothing while waiting for the OS to retrieve the data.
  • The code reads too much data and generates too many temporary objects. ReadAllText reads the entire file when only 1000 characters are needed. Strings are immutable so html.Substring(1,1000) generates a new substring. All these take up memory and have to be garbage-collected at some point.
  • ConcurrentBag isn't a general-use concurrent collection like ConcurrentQueue or ConcurrentDictionary. It uses thread-local storage to ensure the thread that created an item can retrieve it faster than other threads.

.NET offers several high-level classes that can be used to construct parsing pipelines far more complex than loading, parsing and importing files. These include Dataflow blocks, Channels and Async Streams/IAsyncEnumerable.

One way to improve the question's code would be to use Dataflow blocks to enumerate the root folder, load file contents and parse it in different blocks with different Degree-Of-Parallelism.

First, the crawling, loading and parsing code can be extracted to separate methods :

record Search(DirectoryInfo root,string pattern);
record Sample(FileInfo file,string sample);
record SampleHtml(FileInfo file,HtmlDocument html);

IEnumerable<FileInfo> Crawl(Search search)
{
    var (root,pattern)=search;
    var searchOptions=new EnumerationOptions { 
        RecurseSubdirectories=true,
        IgnoreInaccessible=true
    };
    return root.EnumerateFiles(pattern,searchOptions);
}

async Task<Sample> ReadSample(FileInfo file,int length)
{
    var buffer=new char[length+1];
    using var reader=file.OpenText();
    var count=await reader.ReadBlockAsync(buffer,length+1);
    var html= new String(buffer,1,count-1);
    return new Sample(file,html);
}

SampleHtml ParseSample(Sample sample)
{
    var html=new HtmlDocument();
    html.LoadHtml(sample.sample);
    return new SampleHtml(sample.file,html);
}

Dataflow blocks can be used to create a pipeline of:

  1. A single-threaded file search block
  2. A loader block loading 2 files at a time
  3. A parser block parsing 4 samples at a time
  4. A result BufferBlock collecting the output of the parser
var loadOptions=new ExecutionDataflowBlockOptions{ 
    MaxDegreeOfParallelism=2,
    BoundedCapacity=1
};
var parseOptions=new ExecutionDataflowBlockOptions{ 
    MaxDegreeOfParallelism=4,
    BoundedCapacity=1
};

var crawler=new TransformManyBlock<Search,FileInfo>(search=>
    Crawl(search);

var loader =  new TransformBlock<FileInfo,Sample>(file=> 
    ReadSample(file,1000),loadOptions);

var parserBlock=new TransformBlock<Sample,SampleHtml>(sample=>
    ParseHtml(sample),parseOptions);

var results=new BufferBlock<SampleHtml>();

var linkOptions=new DataflowLinkOptions {
    PropagateCompletion = true
};
crawler.LinkTo(loader,linkOptions);
loader.LinkTo(parser,linkOptions);
//Don't propagate completion, we just cache results here
parser.Linkto(results);

To use the pipeline, we post a search specification to the head block, crawler and await until the final block, parser, has completed processing everything

var root=new DirectoryInfo(path_to_root);
var pattern="*.html";
await crawler.SendAsync(new Search(root,pattern));
crawler.Complete();
await parser.Completion;

A this point results contains all results. We can use TryReceive to pop items one by one or TryReceiveAll to read everything into a container :

if(results.TryReceiveAll(out var docs)
{
    var last=docs[^1];
}

The loader and parser block have a BoundedCapacity of 1. This means their input buffer will only accept a single item beyond those being processed. Any upstream blocks will have to wait before posting new items, all the way up to the crawler. This prevents filling the memory with objects that can't be processed fast enough.

Reusing buffers

The ArrayPool class can provide reusable buffers and thus avoid creating a new char[1001] buffer for each file. With a loader DOP of 4, this means we only need 4 buffers instead of 3000 buffers:

async Task<Sample> ReadSample(FileInfo file,int length)
{
    var buffer=ArrayPool<char>.Shared.Rent(length+1);
    try
    {
        using var reader=file.OpenText();
        var count=await reader.ReadBlockAsync(buffer,0,length+1);
        var html= new String(buffer,1,count-1);
        return new Sample(file,html);
    }
    finally
    {
        ArrayPool<char>.Shared.Return(buffer);
    }
}

This leaves the 3000 1000-char string objects. These can be eliminated if the loader and parser are modified to pass byte[] buffers instead of strings. HtmlAgilityPack's Load can read from any Stream or StreamReader which means it can also load from a MemoryStream wrapping a buffer.

The only problem is that UTF8 uses a variable number of bytes per character so it's not possible to guess how many bytes are needed to read exactly 1000 characters in advance. If the HTML files are expected to contain a lot of non-English text ReadSample's length will have to be increased.

record Sample(FileInfo file,byte[] buffer,int count);

async Task<Sample> ReadSample(FileInfo file,int length)
{
    var buffer=ArrayPool<char>.Shared.Rent(length+1);
    using var reader=file.OpenText();
    var count=await reader.ReadBlockAsync(buffer,0,length+1);
    return new Sample(file,buffer,count);
}

SampleHtml ParseSample(Sample sample)
{
    try
    {
        var html=new HtmlDocument();
        using var ms=new MemoryStream(sample.buffer,1,sample.count);
        html.Load(ms);
        return new SampleHtml(sample.file,html);
    }
    finally
    {
        ArrayPool<char>.Shared.Return(sample.buffer);
    }
}
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236