4

My application indexes contents of all hard drives on end users computers. I am using Directory.GetFiles and Directory.GetDirectories to recursively process the whole folder structure. I am indexing only a few selected file types (up to 10 filetypes).

I am seeing in profiler that most of the indexing time is spent in enumerating files and folders - depending on ratio of files that will actually be indexed up to 90 percent of time.

I would like to make the indexing as fast as possible. I have already optimized the indexing itself and processing of the indexed files.

I was thinking using Win32 API calls, but I am actually seeing in the profiler that most of the processing time is actually spent on these API calls done by .NET.

Is there a (possibly low level) method accessible from C# that would make enumeration of files/folders at least partially faster?


As requested in the comment, my current code (just a scheme with irrelevant parts trimmed):

    private IEnumerable<IndexedEntity> RecurseFolder(string indexedFolder)
    {
        //for a single extension:
        string[] files = Directory.GetFiles(indexedFolder, extensionFilter);
        foreach (string file in files)
        {
            yield return ProcessFile(file);
        }
        foreach (string directory in Directory.GetDirectories(indexedFolder))
        {
            //recursively process all subdirectories
            foreach (var ie in RecurseFolder(directory))
            {
                yield return ie;
            }
        }
    }
Marek
  • 10,307
  • 8
  • 70
  • 106
  • You mind sharing the code you have by now? – Bobby Jan 18 '10 at 11:09
  • Performance-wise, the "API depth" doesn't matter much. Most important is the recursive strategy, reading / processing all files in the current directory before going into sub folders (Mrk Gravell gets that right - assuming the GetFiles() / GetDirectories calls do read all, and the GetDirectories call is served from the file system cache). – peterchen Jan 18 '10 at 12:33
  • A faster method is proposed here, I will try that: http://stackoverflow.com/questions/724148/is-there-a-faster-way-to-scan-through-a-directory-recursively-in-net/724184#724184 - still interested in other options though – Marek Jan 18 '10 at 12:56
  • That implementation - while skipping the .NET wrappers - goes into sub directoreies early, thus keeping more state around and potentially trashing the cache. I am surprised at the 5-10x claim, though - please measure before you pick (and measure carefully...) – peterchen Jan 18 '10 at 13:16

2 Answers2

2

In .NET 4.0, there are inbuilt enumerable file listing methods; since this isn't far away, I would try using that. This might be a factor in particular if you have any folders that are massively populated (requiring a large array allocation).

If depth is the issue, I would consider flattening your method to use a local stack/queue and a single iterator block. This will reduce the code path used to enumerate the deep folders:

    private static IEnumerable<string> WalkFiles(string path, string filter)
    {
        var pending = new Queue<string>();
        pending.Enqueue(path);
        string[] tmp;
        while (pending.Count > 0)
        {
            path = pending.Dequeue();
            tmp = Directory.GetFiles(path, filter);
            for(int i = 0 ; i < tmp.Length ; i++) {
                yield return tmp[i];
            }
            tmp = Directory.GetDirectories(path);
            for (int i = 0; i < tmp.Length; i++) {
                pending.Enqueue(tmp[i]);
            }
        }
    }

Iterate that, creating your ProcessFiles from the results.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 1
    One thing to add - watch out for reparse points. Otherwise, you might end up in an infinite recursion. For an example, see here: http://weblogs.asp.net/israelio/archive/2004/06/23/162913.aspx – peterchen Jan 18 '10 at 12:35
  • @peterchen - indeed; they're always fun. – Marc Gravell Jan 18 '10 at 12:40
  • .NET 4.0 is not an option for me, this is a .NET 2.0 application – Marek Jan 18 '10 at 12:52
  • They are in since .NET 2.0: http://msdn.microsoft.com/de-de/library/07wt70x2(VS.80).aspx – peterchen Jan 18 '10 at 13:12
  • @peterchen: you have posted a different link - the GetFiles obviously has been there for ages :), Marc refers to Directory.EnumerateFiles method: http://msdn.microsoft.com/en-us/library/dd383571(VS.100).aspx – Marek Jan 18 '10 at 14:46
1

If you believe that the .NET implementation is causing the problem then I suggest that you use the winapi calls _findfirst, _findnext etc.

It seems to me that .NET requires a lot of memory for because the lists are completely copied into the arrays at each level of directory - so if your directory structure is 10 levels deep you have 10 versions of the array files at any given moment and an allocation/deallocation of this array for every directory in the structure.

Using the same recursive technique with _findfirst etc will only require that handles to a position in the directory structure be kept at every level of recursion.

Elemental
  • 7,365
  • 2
  • 28
  • 33
  • There is no problem in the .NET implementation, at least not manifesting in my case. I simply want to make this faster. – Marek Jan 18 '10 at 14:58
  • I meant that the NET implementation was slowing the execution; was causing a performance problem. – Elemental Jan 18 '10 at 15:26