How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

Question

The Problem:

Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.

The Current Solution

We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.

The Concern

The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.

I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.

Typical Batch Size

We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.

You should use `Task` . Tasks are more efficient than thread in using multiple processors. — MKR, Aug 08 '17 at 04:26

score 2 · Answer 1 · answered Aug 08 '17 at 04:23

As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.

Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.

score 1 · Accepted Answer · answered Aug 09 '17 at 19:36

1

The current solution works fine

Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?

That said...

I need to understand how best to use threading or any other technology to make best use of available resources on one server

If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.

and go beyond one server (if required) to scale further.

That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.

answered Aug 09 '17 at 19:36

Stephen Cleary

437,863
77
675
810

Thanks, would it help if I split it on two services where, one takes care of downloading \ uploading and the other takes care of CPU intensive work? And in future, if required, scale it using producer-consumer pattern? – Ravi M Patel Aug 09 '17 at 19:49
I would say just to use TPL Dataflow in a single service for now. Then use queues when you're ready to scale out. – Stephen Cleary Aug 09 '17 at 19:56
Ok, would you please look at this if you have time, may be now I don't need it but I'd still like to know out of curiosity. https://stackoverflow.com/q/45585585/3317709 – Ravi M Patel Aug 09 '17 at 20:01

score 1 · Answer 3 · answered Mar 06 '20 at 20:22

According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.

Also, you won't need to bother about handling multiple windows services. Additionally it can restore in case of failure at application level or restart events.

score 0 · Answer 4 · answered Aug 08 '17 at 04:53

There is no magic technology that will solve your problem, you need to analyse each part of it step by step.

You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.

This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.

You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.

As your program has multiple steps then you can look at each of these in turn, and as a whole.

My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.

A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.

Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.

score 0 · Answer 5 · edited Aug 08 '17 at 06:39

To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.

Try to create the threads for each file types which are large in size eg: PDF's.
You can check for these file types while downloading the SFTP file path and assign them to a thread process to download.

You also need to upload the files also in vice versa.

--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.

How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

5 Answers5