I need to process a large number of files overnight, with a defined start and end time to avoid disrupting users. I've been investigating but there are so many ways of handling threading now that I'm not sure which way to go. The files come into an Exchange inbox as attachments.
My current attempt, based on some examples from here and a bit of experimentation, is:
while (DateTime.Now < dtEndTime.Value)
{
var finished = new CountdownEvent(1);
for (int i = 0; i < numThreads; i++)
{
object state = offset;
finished.AddCount();
ThreadPool.QueueUserWorkItem(delegate
{
try
{
StartProcessing(state);
}
finally
{
finished.Signal();
}
});
offset += numberOfFilesPerPoll;
}
finished.Signal();
finished.Wait();
}
It's running in a winforms app at the moment for ease, but the core processing is in a dll so I can spawn the class I need from a windows service, from a console running under a scheduler, however is easiest. I do have a Windows Service set up with a Timer object that kicks off the processing at a time set in the config file.
So my question is - in the above code, I initialise a bunch of threads (currently 10), then wait for them all to process. My ideal would be a static number of threads, where as one finishes I fire off another, and then when I get to the end time I just wait for all threads to complete. The reason for this is that the files I'm processing are variable sizes - some might take seconds to process and some might take hours, so I don't want the whole application to wait while one thread completes if I can have it ticking along in the background. (edit)As it stands, each thread instantiates a class and passes it an offset. The class then gets the next x emails from the inbox, starting at the offset (using the Exchange Web Services paging functionality). As each file is processed, it's moved to a separate folder. From some of the replies so far, I'm wondering if actually I should grab the e-mails in the outer loop, and spawn threads as needed. To cloud the issue, I currently have a backlog of e-mails that I'm trying to process through. Once the backlog has been cleared, it's likely that the nightly run will have a significantly lower load.
On average there are around 1000 files to process each night.
Update
I've rewritten large chunks of my code so that I can use the Parallel.Foreach and I've come up against an issue with thread safety. The calling code now looks like this:
public bool StartProcessing()
{
FindItemsResults<Item> emails = GetEmails();
var source = new CancellationTokenSource(TimeSpan.FromHours(10));
// Process files in parallel, with a maximum thread count.
var opts = new ParallelOptions { MaxDegreeOfParallelism = 8, CancellationToken = source.Token };
try
{
Parallel.ForEach(emails, opts, processAttachment);
}
catch (OperationCanceledException)
{
Console.WriteLine("Loop was cancelled.");
}
catch (Exception err)
{
WriteToLogFile(err.Message + "\r\n");
WriteToLogFile(err.StackTrace + "r\n");
}
return true;
}
So far so good (excuse temporary error handling). I have a new issue now with the fact that the properties of the "Item" object, which is an email, not being threadsafe. So for example when I start processing an e-mail, I move it to a "processing" folder so that another process can't grab it - but it turns out that several of the threads might be trying to process the same e-mail at a time. How do I guarantee that this doesn't happen? I know I need to add a lock, can I add this in the ForEach or should it be in the processAttachments method?