1

I have a WPF app that reads an Outlook .pst file, extracts each message, and saves both it and any attachments as .pdf files. After that's all done, it does some other processing on the files.

I'm currently using a plain old foreach loop for the first part. Here is a rather simplified version of the code...

// These two are used by the WPF UI to display progress
string BusyContent;
ObservableCollection<string> Msgs = new();
// See note lower down about the quick-and-dirty logging
string _logFile = @"C:\Path\To\LogFile.log";
// _allFiles is used to keep a record of all the files we generate. Used after the loop ends
List<string> _allFiles = new();
// nCurr is used to update BusyContent, which is bound to the UI to show progress
int nCurr = 0;
// The messages would really be extracted from the .pst file. Empty list used for simplicity
List<Message> messages = new();

async Task ProcessMessages() {
  using StreamWriter logFile = new(_logFile, true);
  foreach (Message msg in messages) {
    nCurr++;
    string fileName = GenerateFileName(msg);
    // We log a lot more, but only one shown for simplicity
    Log(logFile, $"File: {fileName}");
    _allFiles.Add(fileName);
    // Let the user know where we are up to
    BusyContent = $"Processing message {nCurr}";
    // Msgs is bound to a WPF grid, so we need to use Dispatcher to update
    Application.Current.Dispatcher.Invoke(() => Msgs.Add(fileName));
    // Finally we write out the .pdf files
    await ProcessMessage(msg);
  }
}

async Task ProcessMessage(Message msg) {
  // The methods called here are omitted as they aren't relevant to my questions
  await GenerateMessagePdf(msg);
  foreach(Attachment a in msg.Attachments) {
    string fileName = GenerateFileName(a);
    // Note that we update _allFiles here as well as in the main loop
    _allFiles.Add(fileName);
    await GenerateAttachmentPdf(a);
  }
}

static void Log(StreamWriter logFile, string msg) =>
  logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);

This all works fine, but can take quite some time on a large .pst file. I'm wondering if converting this to use Parallel.ForEach would speed things up. I can see the basic usage of this method, but have a few questions, mainly concerned with the class-level variables that are used within the loop...

  1. The logFile variable is passed around. Will this cause issues? This isn't a major problem, as this logging was added as a quick-and-dirty debugging device, and really should be replaced with a proper logging framework, but I'd still like to know if what I'm dong would be an issue in the parallel version

  2. nCurr is updated inside the loop. Is this safe, or is there a better way to do this?

  3. _allFiles is also updated inside the main loop. I'm only adding entries, not reading or removing, but is this safe?

  4. Similarly, _allFiles is updated inside the ProcessMessage method. I guess the answer to this question depends on the previous one.

  5. Is there a problem updating BusyContent and calling Application.Current.Dispatcher.Invoke inside the loop?

Thanks for any help you can give.

Charlieface
  • 52,284
  • 6
  • 19
  • 43
Avrohom Yisroel
  • 8,555
  • 8
  • 50
  • 106
  • 2
    [`List` is not threadsafe so, don't do `_allFiles.Add(fileName);`](https://stackoverflow.com/questions/5589315/list-add-thread-safety). You should use a [threadsafe collection](https://learn.microsoft.com/en-us/dotnet/standard/collections/thread-safe/) object – Liam Nov 02 '21 at 17:05
  • 2
    None of this is thread-safe. Use `Interlocked` for global variables, use concurrent collections instead of `List`. So in answer: 1. No, you need to use `lock` 2. not safe, use `Interlocked` 3. ditto 4. ditto 5. No issue. – Charlieface Nov 02 '21 at 17:16
  • @Charlieface Thanks to both of you for your comments, pretty much what I expected. Is there a more idiomatic way of doing this, or are `lock`, `Interlocked` and so on the right way to do it? Thanks again – Avrohom Yisroel Nov 02 '21 at 18:09
  • 2
    My suggestion is to get rid of all the side-effecting code, because it makes parallelization challenging, fragile and inefficient. Try to make your methods `static`, accept all the input through their arguments, and propagate their output through their return value. For example don't return `Task`, return `Task>` instead. – Theodor Zoulias Nov 02 '21 at 20:21

1 Answers1

2

At first, it is necessary to use thread safe collections:

ObservableConcurrentCollection<string> Msgs = new();
ConcurrentQueue<string> _allFiles = new();

ObservableConcurrentCollection can be installed through NuGet. ConcurrentQueue is located in using System.Collections.Concurrent;. Special thanks to Theodor Zoulias for the pointing out that there is better option for ConcurentBag.

And then it is possible to use Parallel.ForEachor Task.

Parallel.ForEach uses Partitioner which allows to avoid creation more tasks than necessary. So it tries to run each method in parallel. So it is better to exclude async and await keywords of methods which participate in Parallel.ForEach.

    async Task  ProcessMessages()
    {
        using StreamWriter logFile = new(_logFile, true);

        await Task.Run(() => {
            Parallel.ForEach(messages, msg =>
            {
                var currentCount = Interlocked.Increment(ref nCurr);
                string fileName = GenerateFileName(msg);
                Log(logFile, $"File: {fileName}");
                _allFiles.Enqueue(fileName);
                BusyContent = $"Processing message {currentCount}";
                ProcessMessage(msg);
            });
        });
    }
    
    
    int ProcessMessage(Message msg)
    {
        // The methods called here are omitted as they aren't relevant to my questions
        var message = GenerateMessagePdf(msg);
        foreach (Attachment a in msg.Attachments)
        {
            string fileName = GenerateFileName(a);                
            _allFiles.Enqueue(fileName);
            GenerateAttachmentPdf(a);
        }
        return msg.Id;
    }


    private string GenerateAttachmentPdf(Attachment a) => string.Empty;


    private string GenerateMessagePdf(Message message) => string.Empty;


    string GenerateFileName(Attachment attachment) => string.Empty;


    string GenerateFileName(Message message) => string.Empty;


    void Log(StreamWriter logFile, string msg) =>
      logFile.WriteLine(DateTime.Now.ToString("yyMMdd-HHmmss.fff") + " - " + msg);
    

And another way is awaiting all tasks. In this case, there is no need to exclude async and await keywords.

    async Task ProcessMessages()
    {
        using StreamWriter logFile = new(_logFile, true);            
        var messageTasks = messages.Select(msg =>
        {                
            var currentCount = Interlocked.Increment(ref nCurr);
            string fileName = GenerateFileName(msg);                
            Log(logFile, $"File: {fileName}");
            _allFiles.Enqueue(fileName);                
            BusyContent = $"Processing message {currentCount}";
            return ProcessMessage(msg);
        });

        var msgs = await Task.WhenAll(messageTasks);
    }
StepUp
  • 36,391
  • 15
  • 88
  • 148
  • @downvoter, could you say the reason of downvoting? It would be really helpful to me as it is a chance to improve my answer. Thanks! – StepUp Nov 03 '21 at 09:30
  • 1
    StepUp sorry for the downvote, but I think that your answer does nothing to address the OP's considerations about the thread-safety of their code. AFAICS your suggested improvement is still mutating the `_allFiles` list from multiple threads concurrently without synchronization, with the most probable outcome being undefined behavior. – Theodor Zoulias Nov 03 '21 at 09:50
  • 1
    @TheodorZoulias It is okay. Thanks for the explained downvoting :). I forgot this point. I've updated my reply. – StepUp Nov 03 '21 at 10:14
  • 1
    Now I have no reason to downvote. :-) But I am not inclined to upvote either, because of my [deep personal distaste](https://stackoverflow.com/questions/15400133/when-to-use-blockingcollection-and-when-concurrentbag-instead-of-listt/64823123#64823123) for the `ConcurrentBag` collection! – Theodor Zoulias Nov 03 '21 at 10:18
  • 1
    @TheodorZoulias thanks that you've pointed out:). It is very nice post. I've edited to `ConcurrentQueue`. It is very good data structure. – StepUp Nov 03 '21 at 10:30
  • 1
    @StepUp Thanks very much for this. It all seems to be working, and has cut the execution time by about 40% which is pretty good! – Avrohom Yisroel Nov 03 '21 at 18:25
  • @AvrohomYisroel I am really glad that it helped to you! :) – StepUp Nov 03 '21 at 18:56