C# Threading - Reading and hashing multiple files concurrently, easiest method?

Question

I've been trying to get what I believe to be the simplest possible form of threading to work in my application but I just can't do it.

What I want to do: I have a main form with a status strip and a progress bar on it. I have to read something between 3 and 99 files and add their hashes to a string[] which I want to add to a list of all files with their respective hashes. Afterwards I have to compare the items on that list to a database (which comes in text files). Once all that is done, I have to update a textbox in the main form and the progressbar to 33%; mostly I just don't want the main form to freeze during processing.

The files I'm working with always sum up to 1.2GB (+/- a few MB), meaning I should be able to read them into byte[]s and process them from there (I have to calculate CRC32, MD5 and SHA1 of each of those files so that should be faster than reading all of them from a HDD 3 times).

Also I should note that some files may be 1MB while another one may be 1GB. I initially wanted to create 99 threads for 99 files but that seems not wise, I suppose it would be best to reuse threads of small files while bigger file threads are still running. But that sounds pretty complicated to me so I'm not sure if that's wise either.

So far I've tried workerThreads and backgroundWorkers but neither seem to work too well for me; at least the backgroundWorkers worked SOME of the time, but I can't even figure out why they won't the other times... either way the main form still froze. Now I've read about the Task Parallel Library in .NET 4.0 but I thought I should better ask someone who knows what he's doing before wasting more time on this.

What I want to do looks something like this (without threading):

List<string[]> fileSpecifics = new List<string[]>();

int fileMaxNumber = 42; // something between 3 and 99, depending on file set

for (int i = 1; i <= fileMaxNumber; i++)
{
    string fileName = "C:\\path\\to\\file" + i.ToString("D2") + ".ext"; // file01.ext - file99.ext
    string fileSize = new FileInfo(fileName).Length.ToString();
    byte[] file = File.ReadAllBytes(fileName);
    // hash calculations (using SHA1CryptoServiceProvider() etc., no problems with that so I'll spare you that, return strings)
    file = null; // I didn't yet check if this made any actual difference but I figured it couldn't hurt
    fileSpecifics.Add(new string[] { fileName, fileSize, fileCRC, fileMD5, fileSHA1 });
}

// look for files in text database mentioned above, i.e. first check for "file bundles" with the same amount of files I have here; then compare file sizes, then hashes
// again, no problems with that so I'll spare you that; the database text files are pretty small so parsing them doesn't need to be done in an extra thread.

Would anybody be kind enough to point me in the right direction? I'm looking for the easiest way to read and hash those files quickly (I believe the hashing takes some time in which other files could already be read) and save the output to a string[], without the main form freezing, nothing more, nothing less.

I'm thankful for any input.

EDIT to clarify: by "backgroundWorkers working some of the time" I meant that (for the very same set of files), maybe the first and fourth execution of my code produces the correct output and the UI unfreezes within 5 seconds, for the second, third and fifth execution it freezes the form (and after 60 seconds I get an error message saying some thread didn't respond within that time frame) and I have to stop execution via VS.

Thanks for all your suggestions and pointers, as you all have correctly guessed I'm completely new to threading and will have to read up on the great links you guys posted. Then I'll give those methods a try and flag the answer that helped me the most. Thanks again!

What do you mean by the BackgroundWorker working some of the time? If implemented correctly, processing done within the BackgroundWorker should not cause the form to freeze. — evasilchenko, Mar 27 '12 at 18:17
This article may be of help to you: http://www.hanselman.com/blog/BackToParallelBasicsDontBlockYourThreadsMakeAsyncIOWorkForYou.aspx — Eric Lippert, Mar 27 '12 at 20:41

score 19 · Answer 1 · edited May 23 '17 at 12:25

With .NET Framework 4.X

Use Directory.EnumerateFiles Method for efficient/lazy files enumeration
Use Parallel.For() to delegate parallelism work to PLINQ framework or use TPL to delegate single Task per pipeline Stage
Use Pipelines pattern to pipeline following stages: calculating hashcodes, compare with pattern, update UI
To avoid UI freeze use appropriate techniques: for WPF use Dispatcher.BeginInvoke(), for WinForms use Invoke(), see this SO answer
Considering that all this stuff has UI it might be useful adding some cancellation feature to abandon long running operation if needed, take a look at the CreateLinkedTokenSource class which allows triggering CancellationToken from the "external scope" I can try adding an example but it's worth do it yourself so you would learn all this stuff rather than simply copy/paste - > got it working -> forgot about it.

PS: Must read - Pipelines paper at MSDN

TPL specific pipeline implementation

Pipeline pattern implementation: three stages: calculate hash, match, update UI
Three tasks, one per stage
Two Blocking Queues

//

// 1) CalculateHashesImpl() should store all calculated hashes here
// 2) CompareMatchesImpl() should read input hashes from this queue
// Tuple.Item1 - hash, Typle.Item2 - file path
var calculatedHashes = new BlockingCollection<Tuple<string, string>>();


// 1) CompareMatchesImpl() should store all pattern matching results here
// 2) SyncUiImpl() method should read from this collection and update 
//    UI with available results
var comparedMatches = new BlockingCollection<string>();

var factory = new TaskFactory(TaskCreationOptions.LongRunning,
                              TaskContinuationOptions.None);


var calculateHashesWorker = factory.StartNew(() => CalculateHashesImpl(...));
var comparedMatchesWorker = factory.StartNew(() => CompareMatchesImpl(...));
var syncUiWorker= factory.StartNew(() => SyncUiImpl(...));

Task.WaitAll(calculateHashesWorker, comparedMatchesWorker, syncUiWorker);

CalculateHashesImpl():

private void CalculateHashesImpl(string directoryPath)
{
   foreach (var file in Directory.EnumerateFiles(directoryPath))
   {
       var hash = CalculateHashTODO(file);
       calculatedHashes.Add(new Tuple<string, string>(hash, file.Path));
   }
}

CompareMatchesImpl():

private void CompareMatchesImpl()
{
   foreach (var hashEntry in calculatedHashes.GetConsumingEnumerable())
   {
      // TODO: obviously return type is up to you
      string matchResult = GetMathResultTODO(hashEntry.Item1, hashEntry.Item2);
      comparedMatches.Add(matchResult);
   }
}

SyncUiImpl():

private void UpdateUiImpl()
{
    foreach (var matchResult in comparedMatches.GetConsumingEnumerable())
    {
        // TODO: track progress in UI using UI framework specific features
        // to do not freeze it
    }
}

TODO: Consider using CancellationToken as a parameter for all GetConsumingEnumerable() calls so you easily can stop a pipeline execution when needed.

score 17 · Answer 2 · answered Mar 27 '12 at 18:15

First off, you should be using a higher level of abstraction to solve this problem. You have a bunch of tasks to complete, so use the "task" abstraction. You should be using the Task Parallel Library to do this sort of thing. Let the TPL deal with the question of how many worker threads to create -- the answer could be as low as one if the work is gated on I/O.

If you do want to do your own threading, some good advice:

Do not ever block on the UI thread. That's is what is freezing your application. Come up with a protocol by which working threads can communicate with your UI thread, which then does nothing except for responding to UI events. Remember that methods of user interface controls like task completion bars must never be called by any other thread other than the UI thread.
Do not create 99 threads to read 99 files. That's like getting 99 pieces of mail and hiring 99 assistants to write responses: an extraordinarily expensive solution to a simple problem. If your work is CPU intensive then there is no point in "hiring" more threads than you have CPUs to service them. (That's like hiring 99 assistants in an office that only has four desks. The assistants spend most of their time waiting for a desk to sit at instead of reading your mail.) If your work is disk-intensive then most of those threads are going to be idle most of the time waiting for the disk, which is an even bigger waste of resources.

+1. With non SSD/flash disk bound task like this it is **likely** (prototype and measure) that adding extra threads will actually slow things down due to hardware restrictions. — Alexei Levenkov, Mar 27 '12 at 18:29

score 2 · Answer 3 · answered Mar 27 '12 at 18:26

First, I hope you are using a built-in library for calculating hashes. It's possible to write your own, but it's far safer to use something that has been around for a while.

You may need only create as many threads as CPUs if your process is CPU intensive. If it is bound by I/O, you might be able to get away with more threads.

I do not recommend loading the entire file into memory. Your hashing library should support updating a chunk at a time. Read a chunk into memory, use it to update the hashes of each algorighm, read the next chunk, and repeat until end of file. The chunked approach will help lower your program's memory demands.

As others have suggested, look into the Task Parallel Library, particularly Data Parallelism. It might be as easy as this:

Parallel.ForEach(fileSpecifics, item => CalculateHashes(item));

Your one liner solution is of course most intriguing :) But of course I'll have to read much more. For MD5 and SHA1 I use System.Security.Cryptography but I couldn't find the same thing for CRC32 so I used Damien Guard's public domain CRC32 class. I'll have to check if that supports calculation with chunks. Thanks! — A. Dumas, Mar 27 '12 at 18:40

score 0 · Answer 4 · answered Mar 27 '12 at 18:15

0

Check out TPL Dataflow. You can use a throttled ActionBlock which will manage the hard part for you.

answered Mar 27 '12 at 18:15

David Peden

17,596
6
52
72

score 0 · Answer 5 · answered Mar 27 '12 at 18:16

If my understanding that you are looking to perform some tasks in the background and not block your UI, then the UI BackgroundWorker would be an appropriate choice. You mentioned that you got it working some of the time, so my recommendation would be to take what you had in a semi-working state, and improve upon it by tracking down the failures. If my hunch is correct, your worker was throwing an exception, which it does not appear you are handling in your code. Unhandled exceptions that bubble out of their containing threads make bad things happen.

tomanu · Answer 6 · 2012-03-27T18:54:25.523

This code hashing one file (stream) using two tasks - one for reading, second for hashing, for more robust way you should read more chunks forward.

Because bandwidth of processor is much higher than of disk, unless you use some high speed Flash drive you gain nothing from hashing more files concurrently.

public void TransformStream(Stream a_stream, long a_length = -1)
{
    Debug.Assert((a_length == -1 || a_length > 0));

    if (a_stream.CanSeek)
    {
        if (a_length > -1)
        {
            if (a_stream.Position + a_length > a_stream.Length)
                throw new IndexOutOfRangeException();
        }

        if (a_stream.Position >= a_stream.Length)
            return;
    }

    System.Collections.Concurrent.ConcurrentQueue<byte[]> queue =
        new System.Collections.Concurrent.ConcurrentQueue<byte[]>();
    System.Threading.AutoResetEvent data_ready = new System.Threading.AutoResetEvent(false);
    System.Threading.AutoResetEvent prepare_data = new System.Threading.AutoResetEvent(false);

    Task reader = Task.Factory.StartNew(() =>
    {
        long total = 0;

        for (; ; )
        {
            byte[] data = new byte[BUFFER_SIZE];
            int readed = a_stream.Read(data, 0, data.Length);

            if ((a_length == -1) && (readed != BUFFER_SIZE))
                data = data.SubArray(0, readed);
            else if ((a_length != -1) && (total + readed >= a_length))
                data = data.SubArray(0, (int)(a_length - total));

            total += data.Length;

            queue.Enqueue(data);
            data_ready.Set();

            if (a_length == -1)
            {
                if (readed != BUFFER_SIZE)
                    break;
            }
            else if (a_length == total)
                break;
            else if (readed != BUFFER_SIZE)
                throw new EndOfStreamException();

            prepare_data.WaitOne();
        }
    });

    Task hasher = Task.Factory.StartNew((obj) =>
    {
        IHash h = (IHash)obj;
        long total = 0;

        for (; ; )
        {
            data_ready.WaitOne();

            byte[] data;
            queue.TryDequeue(out data);

            prepare_data.Set();

            total += data.Length;

            if ((a_length == -1) || (total < a_length))
            {
                h.TransformBytes(data, 0, data.Length);
            }
            else
            {
                int readed = data.Length;
                readed = readed - (int)(total - a_length);
                h.TransformBytes(data, 0, data.Length);
            }

            if (a_length == -1)
            {
                if (data.Length != BUFFER_SIZE)
                    break;
            }
            else if (a_length == total)
                break;
            else if (data.Length != BUFFER_SIZE)
                throw new EndOfStreamException();
        }
    }, this);

    reader.Wait();
    hasher.Wait();
}

Rest of code here: http://hashlib.codeplex.com/SourceControl/changeset/view/71730#514336

C# Threading - Reading and hashing multiple files concurrently, easiest method?

6 Answers6

TPL specific pipeline implementation

Linked