The Problem:
- Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
- Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
- Upload this batch to dropoff.fileserver (SFTP)
- Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
- Parse the batch response, update database and finally upload report to pickup.fileserver
- Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T>
etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.