1

Currently I have a .txt file of about 170,000 jpg file names and I read them all into a List (fileNames).

I want to search ONE folder (this folder has sub-folders) to check if each file in fileNames exists in this folder and if it does, copy it to a new folder.

I was making a rough estimate but each search and copy for each file name in fileNames takes about .5 seconds. So 170,000 seconds is roughly 48 hours so divide by 2 that will take about 24 hours for my app to have searched for every single file name using 1 thread! Obviously this is too long so I want to narrow this down and speed the process up. What is the best way to go about doing this using multi-threading?

Currently I was thinking of making 20 separate threads and splitting my list (fileNames) into 20 different lists and search for the files simultaneously. For example I would have 20 different threads executing the below at the same time:

            foreach (string str in fileNames)
            {
                foreach (var file in Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories))
                {
                    string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
                    if (!File.Exists(combinedPath))
                    {
                        File.Copy(file, combinedPath);
                    }
                }
            }

UPDATED TO SHOW MY SOLUTION BELOW:

            string[] folderToCheckForFileNames = Directory.GetFiles("C:\\Users\\Alex\\Desktop\\ok", "*.jpg", SearchOption.AllDirectories);

            foreach(string str in fileNames)
            {
                Parallel.ForEach(folderToCheckForFileNames, currentFile =>
                  {
                      string filename = Path.GetFileName(currentFile);
                      if (str == filename)
                      {
                          string combinedPath = Path.Combine(targetDir, filename);
                          if (!File.Exists(combinedPath))
                          {
                              File.Copy(currentFile, combinedPath);
                              Console.WriteLine("FOUND A MATCH AND COPIED" + currentFile);
                          }
                      }

                  }
                );

            }

Thank you everyone for your contributions! Greatly Appreciated!

Alex Urrutia
  • 45
  • 1
  • 9
  • 1
    If I'm reading you right, why not read all of the file names once into memory, like a HashSet, and then use that to search for the file. As for speeding up disc IO with multiple threads, that only goes so far. Once disc IO is maxed out, it doesn't matter how many threads you have. – cost Jun 22 '15 at 03:37
  • Not just disk IO it will also heavily depend on number of processing cores available to process the thread logic, so ultimately a bad solution – Mrinal Kamboj Jun 22 '15 at 03:43
  • Have you tried using TPL foreach? https://msdn.microsoft.com/en-us/library/dd460720(v=vs.110).aspx – qamar Jun 22 '15 at 03:51
  • So you guys are saying to actually read in ALL the actual .jpg files from folderToCheckForFileName into memory and search against that? Instead of checking the actual folder on my machine? – Alex Urrutia Jun 22 '15 at 04:12
  • Don't read the *files*, just read the list of filenames – Blorgbeard Jun 22 '15 at 04:34

2 Answers2

0

20 different threads won't help if your computer has fewer than 20 cores. In fact, it can make the process slower because you will 1) have to spend time context switching between each thread (which is your CPU's way of emulating more than 1 thread / core) and 2) a Thread in .NET reserves 1 MB for its stack, which is pretty hefty.

Instead, try dividing your I/O into async workloads, using Task.Run for the CPU-bound / intensive parts. Also, keep your number of Tasks to maybe 4 to 8 at the max.

Sample code:

var tasks = new Task[8];
var names = fileNames.ToArray();
for (int i = 0; i < tasks.Length; i++)
{
    int index = i;
    tasks[i] = Task.Run(() =>
    {
        for (int current = index; current < names.Length; current += 8)
        {
            // execute the workload
            string str = names[current];
            foreach (var file in Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories))
            {
                string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
                if (!File.Exists(combinedPath))
                {
                    File.Copy(file, combinedPath);
                }
            }
        }
    });
}
Task.WaitAll(tasks);
James Ko
  • 32,215
  • 30
  • 128
  • 239
  • I went with Parallel foreach loop instead. I will keep this in mind to try this method in the future thank you. Any pros or cons between the two? – Alex Urrutia Jun 22 '15 at 06:24
  • Hm, it seems that you actually may have chosen a [better solution](http://stackoverflow.com/questions/5009181/parallel-foreach-vs-task-factory-startnew). The difference is that `Parallel.ForEach` is synchronous and blocks until everything is finished, but we already kind of did that since we did a `Task.WaitAll` at the end. Also, `Parallel.ForEach` uses a `Partitioner` to distribute the tasks evenly. See the link I posted for more details. – James Ko Jun 22 '15 at 14:47
0

Instead of using ordinary foreach statement in doing your search, you should use parallel linq. Parallel linq combines the simplicity and readability of LINQ syntax with the power of parallel programming. Just like code that targets the Task Parallel Library. This will shield you from low level thread manipulation and probable exceptions (hard to find/debug exceptions) while splitting your work among many threads. So you might do something like this:

fileNames.AsParallel().ForAll(str =>
            {
                var files = Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories);
                files.AsParallel().ForAll(file =>
                {
                    if (!string.IsNullOrEmpty(file))
                    {
                        string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
                        if (!File.Exists(combinedPath))
                        {
                            File.Copy(file, combinedPath);
                        }
                    }
                });
            });
Cizaphil
  • 490
  • 1
  • 6
  • 23
  • I went with a parallel foreach loop thanks to qamar up above. This seems like the same concept except a little easier to read for me than my code. I will post an update of my code above. What are the differences between the solution I came up with via qamar and yours? – Alex Urrutia Jun 22 '15 at 06:25
  • There are no too much difference between them. Both of them are looping constructs, though `Parallel.Foreach()` is more popular. But `Parallel.ForAll() is usually used at the end of a possible complex PLINQ query. So `Parallel.Foreach()` is a better choice for you in your case – Cizaphil Jun 22 '15 at 06:58