28

I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.

How might I accomplish something like this?

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • 3
    Maybe i'm reading too much in to it, but this line *"Everything" requires administrative privileges for low level read access to volumes* and this line *"Everything" will only locate files and folders on local NTFS volumes* would lead me to believe that it is using kernel mode API calls to do its thing, or maybe interfacing with the file system driver directly. – slugster Mar 24 '10 at 22:08
  • @slugster: Yes, that's probably true. +1 The question is, how do I do that? – Billy ONeal Mar 25 '10 at 02:52
  • As an educated guess, i would say they may be using a filter driver. I might actually download it and have a play. I notice they don't mention Win7, which may lend further weight to my speculation, as the driver model changed somewhat with Vista, and then again with Win7. – slugster Mar 25 '10 at 07:10
  • @slugster: I can confirm it does not use a driver, and it works just fine on Win7. – Billy ONeal Mar 25 '10 at 14:17
  • @Billy: [New question on the same topic which got some good answers](http://stackoverflow.com/questions/4203573/finding-a-set-of-file-names-quickly-on-ntfs-volumes-ideally-via-its-mft) – Ben Voigt Dec 30 '10 at 06:31
  • @basj - NTFS-Search (http://sourceforge.net/projects/ntfs-search/) and NTFSParserLib (https://www.codeproject.com/Articles/81456/An-NTFS-Parser-Lib) projects are open source and still work fine today. What more do you need? – Simon Mourier Jun 24 '17 at 08:30
  • @SimonMourier : there seem to be software that use $MFT (master file table) only, and some other which use only the USN journal. Both seem to be able to search quickly. A canonical answer would still be useful to clarify this. – Basj Jun 24 '17 at 09:18
  • @basj - Windows Change Journal only contains changes from a certain point in time (https://msdn.microsoft.com/en-us/library/windows/desktop/aa363803.aspx). You need to initialize with $MFT before you can use the journal (NTFS-Search does that perfectly). Is this the "canonical answer" you need?. – Simon Mourier Jun 24 '17 at 09:47
  • 1
    OT: I'm not sure why, but as far as I tried, `cmd /c dir /s /b` is faster than FindFirstFile. – mattn Jun 28 '17 at 13:24

6 Answers6

8

I realize this is an old post, but there is a project on source forge that does exactly what you are asking and the source code is available.

You can find the project here: NTFS-Search

Scott Dillman
  • 821
  • 1
  • 9
  • 16
7

"Everything" builds an index in the background, so queries are against the index not the file system itself.

There are a few improvements to be made - at least over the straight-forward algorrithm:

First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.

On Windows 7 / W2K8R2, you can use FindFirstFileEx with FindExInfoBasic, the main speedup being omitting the short file name on NTFS file systems where this is enabled.

Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).


[edit] Wikipedia actually has some comments - Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.

A good starting point would be the NTFS Technical Reference on MSDN.

peterchen
  • 40,917
  • 20
  • 104
  • 186
  • Definately reading from an index is quicker, but how do they build/update that index to begin with? Are they fudging their claimed times, or are they doing something special under the hood? – slugster Mar 24 '10 at 22:38
  • The NTFS change journal helps with keeping an index up to date. – Ben Voigt Mar 24 '10 at 23:44
  • @Ben Voigt: Yes, but if I erase the change journal and Everything's cache, start time is still well under a minute. @peterchen: Everything works on Windows XP, so I don't think the new Windows 7 API is the reason for it's speed. – Billy ONeal Mar 25 '10 at 02:04
  • @peterchen you need to put your **edit** at the top of this answer. it is the fundamental cause of the speedup. – unixman83 Sep 17 '11 at 02:11
6

"Everything" accesses directory information at a lower level than the Win32 FindFirst/FindNext APIs.

I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).

A couple other utilities that do the similar things are:

A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.

Shiro
  • 2,610
  • 2
  • 20
  • 36
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • 1
    "A little reverse engineering with a debugger" <-- I'd prefer to not break EULAs if at all possible. Also, I do not know x86 assembler which would make "debugging" difficult. – Billy ONeal Mar 25 '10 at 02:08
  • 3
    @BillyONeal: I can understand both points. As an aside - you don't necessarily need to delve into assembly language to get information. Tools like Dependency Walker (http://www.dependencywalker.com/) and Process Monitor (http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx) can tell you a lot about the APIs that a program uses without diving all the way into disassembly. Whether using these tools might violate a EULA is something I can't tell you. – Michael Burr Mar 25 '10 at 04:51
5

Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
2

If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
1

If you are doing this on NTFS, here's a lib for low level access: NTFSLib.

You can enumerate through all file records in $MFT, each representing a real file on disk. You can get all file attributes from the record, including $DATA.

This may be the fastest way to enumerate all files/directories on NTFS volumes, 200k~300k files per minute as I tested.

S. Liu
  • 1,018
  • 2
  • 10
  • 25
  • 3
    GPL though :( Not an option for this project as it uses a liberal license (namely, Boost Software License; though I am considering switching to 2 clause BSD...) – Billy ONeal Sep 07 '12 at 17:42
  • @BillyONeal: apparently that library is now under BSD license. But I do remember precluding it as a choice for my own projects on the basis of being GPL some time ago. – 0xC0000022L Oct 18 '18 at 14:55