I am processing PDFs of vastly varying sizes (simple 2MB to high DPI scans of a few hundred MB) via a Parallel.ForEach and am occasionally getting to an OutOfMemoryException - understandably due to the process being 32 bit and the threads spawned by the Parallel.ForEach taking up an unknown amount of memory consuming work.
Restricting MaxDegreeOfParallelism
does work, though the throughput for the times when there is a large (10k+) batch of small PDFs to work with is not sufficient as there could be more threads working due to the small memory footprint of said threads. This is a CPU heavy process with Parallel.ForEach easily reaching 100% CPU before hitting the occasional group of large PDFs and getting an OutOfMemoryException. Running the Performance Profiler backs this up.
From my understanding, having a partitioner for my Parallel.ForEach won't improve my performance.
This leads me to using a custom TaskScheduler
passed to my Parallel.ForEach with a MemoryFailPoint
check. Searching around it seems there is scarce information on creating custom TaskScheduler
objects.
Looking between Specialized Task Schedulers in .NET 4 Parallel Extensions Extras, A custom TaskScheduler in C# and various answers here on Stackoverflow, I've created my own TaskScheduler
and have my QueueTask
method as such:
protected override void QueueTask(Task task)
{
lock (tasks) tasks.AddLast(task);
try
{
using (MemoryFailPoint memFailPoint = new MemoryFailPoint(600))
{
if (runningOrQueuedCount < maxDegreeOfParallelism)
{
runningOrQueuedCount++;
RunTasks();
}
}
}
catch (InsufficientMemoryException e)
{
// somehow return thread to pool?
Console.WriteLine("InsufficientMemoryException");
}
}
While the try/catch is a little expensive my goal here is to catch when the probable maximum size PDF (+ a little extra memory overhead) of 600MB will throw an OutOfMemoryException. This solution through seems to kill off the thread attempting to do the work when I catch the InsufficientMemoryException. With enough large PDFs my code ends up being a single thread Parallel.ForEach.
Other questions found on Stackoverflow on Parallel.ForEach and OutOfMemoryExceptions don't appear to suit my use case of maximum throughput with dynamic memory usage on threads and often just leverage MaxDegreeOfParallelism
as a static solution, E.g.:
- Parallel.For System.OutOfMemoryException
- Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object
So to have maximum throughput for variable working memory sizes, either:
- How do I return a thread back into the threadpool when it has been denied work via the
MemoryFailPoint
check? - How/where do I safely spawn new threads to pick up work again when there is free memory?
Edit: The PDF size on disk may not linearly represent size in memory due to the rasterization and rasterized image manipulation component which is dependent on the PDF content.