1

I'm working on a console application written in c#

The purpose of this app is to go through all drives and files and do something on them. But going through all files with a single thread is a time consuming process which is not my goal.

So I decided to use ThreadPool to handle it like so :

class Program () {
    static void Main(string[] args) {
        foreach (var d in DriveInfo.GetDrives()) {
            ThreadPool.QueueUserWorkItem(x => Search(d.RootDirectory.GetDirectories()));
        }

        Console.WriteLine("Job is done.");
        Console.ReadKey();
    }

    private static void Search(DirectoryInfo[] dirs) {
        foreach (var dir in dirs) {
            try {
                foreach (var f in dir.GetFiles()) {
                    ThreadPool.QueueUserWorkItem(x => DoTheJob(f));
                }

                ThreadPool.QueueUserWorkItem(x => Search(dir.GetDirectories()));
            } catch (Exception ex) {
                continue;
            }
        }
    }       
}

The problem is Console.WriteLine("Job is done.") executes before all threads get done. I've read some questions and answers but none of them addressed my problem.

How can I call a method after all threads in the ThreadPool finished their job?

Note: As you might know, I have no idea how many threads will be created because I don't know how many files are there. And setting timeout is not an option.

Hooman Limouee
  • 1,143
  • 2
  • 21
  • 43

2 Answers2

1

Using QueueUserWorkItem() is the low level, barebones approach. With no control over your jobs, it's fire and forget.

Tasks run on top of the ThreadPool, and async/await can solve your problem here.

The toplevel:

var tasks = new List<Task>();
foreach (var d in DriveInfo.GetDrives())
{
    tasks.Add( Search(d.RootDirectory.GetDirectories()));
}
Task.WaitAll(tasks.ToArray());

and then you Search() becomes

private static async Task Search(DirectoryInfo[] dirs)
{
    ... 
    foreach(...)
    {
        await Task.Run(...);
    } 
    await Search(dir.GetDirectories());
}

That DoTheJob() thing should idealy use async I/O but otherwise you can await Task.Run( () => DoTheJob(f))

H H
  • 263,252
  • 30
  • 330
  • 514
  • So if ```DoTheJob``` is not an async method, can this be a bottleneck? – Hooman Limouee May 08 '19 at 16:15
  • Using your provided solution takes a long time, seems like a single thread job. Without help of tasks and threads it took approximate 8 mins to traverse all files, But using ThreadPool it took about 1 minute. – Hooman Limouee May 08 '19 at 16:19
  • But how much of it was really done after that minute? – H H May 08 '19 at 16:20
  • The job was just printing it to the console actually, I can't prove that all files were printed though – Hooman Limouee May 08 '19 at 16:22
  • 1
    There is no async API for getting list of file system objects in a directory. From other hand `Task.Run` doesn't have upper limit control which can easily lead to a thread pool pressure with big amount of file system objects. My suggestion would be to use either `Parallel` API which is data oriented and has control of parallelism or TPL Dataflow which is more suitable for non-trivial flows which may imply recursion. – Dmytro Mukalov May 08 '19 at 16:42
  • @DmytroMukalov - the async API is EnumerateFiles but it isn't Task based. The current approach will use too little Tasks sooner than too much, there will only be one DoTheJob active per drive. But for Console.WriteLine it is okay. – H H May 08 '19 at 16:51
  • Using ```Parallel.ForEach``` was a great idea to my problem. Is there a smart way to limit threads? The CPU is being used 100% (and it might be for a long time) @DmytroMukalov – Hooman Limouee May 08 '19 at 16:52
  • When your CPU is at 100% you do have enough Parallelism already. It does surprise me a little though. – H H May 08 '19 at 16:54
  • @HenkHolterman - When the application is running with ```Parallel.ForEach``` it's at 100%, not all the time. – Hooman Limouee May 08 '19 at 17:00
  • @HenkHolterman, the problem isn't that you don't have enough of parallelism, the problem that you have more than enough which may cause at some point spawning of new threads in the thread pool. The current approach as I can see is supposed to produce a task for every file object it encounters and it can become a problem. – Dmytro Mukalov May 08 '19 at 17:01
  • 1
    @Hooman, in order to effectively utilize `Parallel.ForEach` the first task should be solved is getting rid of recursion somehow. – Dmytro Mukalov May 08 '19 at 17:06
  • @DmytroMukalov - I even limited ```MaxDegreeOfParallelism``` to use only 2 processors of 4, but still 100% CPU usage. You're right, getting rid of recursion must be effective. – Hooman Limouee May 08 '19 at 17:11
  • You might start by replacing GetFiles by EnumerateFiles, same for the Directories. – H H May 08 '19 at 17:13
  • @DmytroMukalov - yes there will be many tasks, but not at the same time. – H H May 08 '19 at 17:14
  • @HenkHolterman, I don't see a complete implementation so can only guess, but from what I can see from the original implementation it can be. The tasks implementation omits some details but I presume that number of directory search tasks depends on how deep the file system structure is, but that's only for the directories. Files are also supposed to be handled in parallel with `DoTheJob` and in that case number of tasks will depend on number of files in a directory as well. – Dmytro Mukalov May 08 '19 at 17:24
  • No, files (on the same drive) are _not_ handled in parallel with DoJob(). The await makes everything sequential. – H H May 08 '19 at 17:28
  • @HenkHolterman, I can see `await` only for subsequent search, no mention of `DoTheJob` in the code. – Dmytro Mukalov May 08 '19 at 17:34
0

Here is example of how you can use Parallel.ForEach to produce fair load:

static IEnumerable<FileSystemInfo> GetFileSystemObjects(DirectoryInfo dirInfo)
{
    foreach (var file in dirInfo.GetFiles())
        yield return file;

    foreach (var dir in dirInfo.GetDirectories())
    {
        foreach (var fso in GetFileSystemObjects(dir))
            yield return fso;
        yield return dir;
    }
}

static void Main(string[] args)
{
    var files = GetFileSystemObjects(new DirectoryInfo(<some path>)).OfType<FileInfo>();

    Parallel.ForEach(files, f =>
    {
        DoTheJob(f);
    });
}

If however DoTheJob contains I/O-bound operations I'd consider to handle it with await as Henk Holterman suggested as Parallel.ForEach is agnostic for I/O load.

Dmytro Mukalov
  • 1,949
  • 1
  • 9
  • 14