5

I'm enumerating the files of a NTFS hard drive partition, by looking at the NTFS MFT / USN journal with:

HANDLE hDrive = CreateFile(szVolumePath, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, NULL, OPEN_EXISTING, NULL, NULL);
DWORD cb = 0;

MFT_ENUM_DATA med = { 0 };
med.StartFileReferenceNumber = 0;
med.LowUsn = 0;
med.HighUsn = MAXLONGLONG;      // no change in perf if I use med.HighUsn = ujd.NextUsn; where "USN_JOURNAL_DATA ujd" is loaded before

unsigned char pData[sizeof(DWORDLONG) + 0x10000] = { 0 }; // 64 kB

while (DeviceIoControl(hDrive, FSCTL_ENUM_USN_DATA, &med, sizeof(med), pData, sizeof(pData), &cb, NULL))
{
        med.StartFileReferenceNumber = *((DWORDLONG*) pData);    // pData contains FRN for next FSCTL_ENUM_USN_DATA

       // here normaly we should do: PUSN_RECORD pRecord = (PUSN_RECORD) (pData + sizeof(DWORDLONG)); 
       // and a second loop to extract the actual filenames
       // but I removed this because the real performance bottleneck
       // is DeviceIoControl(m_hDrive, FSCTL_ENUM_USN_DATA, ...)
}

It works, it is much faster than usual FindFirstFile enumeration techniques. But I see it's not optimal yet:

  • On my 700k files C:\, it takes 21 sec. (This measure has to be done after reboot, if not, it will be incorrect because of caching).

  • I have seen another indexing software (not Everything, another one) able to index C:\ in < 5 seconds (measured after Windows startup), without reading a pre-calculated database in a .db file (or other similar tricks that could speed up things!). This software does not use FSCTL_ENUM_USN_DATA, but low-level NTFS parsing instead.

What I've tried to improve performance:

Question:

Is it possible to improve performance DeviceIoControl(hDrive, FSCTL_ENUM_USN_DATA, ...)?

or is the only way to improve performance to do low-level manual parsing of NTFS?


Note: According to tests, the total size to be read during these DeviceIoControl(hDrive, FSCTL_ENUM_USN_DATA, ...) for my 700k files is only 84MB. 21 second to read 84MB is only 4 MB/sec (and I do have a SSD!). There is probably some room for performance improvement, don't you think so?

Basj
  • 41,386
  • 99
  • 383
  • 673
  • The most obvious thing to try is to increase the buffer size to reduce the number of round trips. But I don't think it will boost performance much, the bottleneck is most likely in converting the records from whatever the underlying format is to the USN_RECORD_Vx structure. – Harry Johnston Jul 19 '17 at 02:13
  • Thanks @HarryJohnston. I have tried 4kb buffer size, 64kb, 1MB (with `#pragma comment(linker, "/STACK:2000000")`), and even with a 100 MB `malloc`-ed array, and it's the same : ~ 21 seconds. – Basj Jul 19 '17 at 07:50
  • You're probably right @HarryJohnston, the bottleneck seems to be the conversion to the USN_RECORD_Vx structure. Do you think there's a way to force-choosing a "lighter conversion" (I don't care about many information, I just need: FileReferenceNumber, ParentFolder, filename)? I don't see this possible here: https://msdn.microsoft.com/en-us/library/windows/desktop/aa364563(v=vs.85).aspx – Basj Jul 19 '17 at 08:19
  • *"This software does not use `FSCTL_ENUM_USN_DATA`, but low-level NTFS parsing instead."* - Isn't that the answer to your question? – IInspectable Jul 19 '17 at 10:49
  • @IInspectable I was hoping there was some solution in the middle between 1. `FSCTL_ENUM_USN_DATA` that has "ok performance" and is easy to code and 2. Super mega fast low-level NTFS parsing that would require one week fulltime to make it work... Do you think this middle point exist somewhere ? – Basj Jul 19 '17 at 11:27
  • The journal is a sparse file, and I don't know what the implications are for reading a range of rows that just don't exist. It's possible that you're spending all your time reading vast empty space...or triggering something grossly inefficient to get past the empty spaces. What I do...and I do it in c#...and it takes just a few seconds...is a call `FSCTL_QUERY_USN_JOURNAL` to get the extents of valid records...into a `USN_JOURNAL_DATA_V1` structure...and use that data to seed the first call to `FSCTL_ENUM_USN_DATA`. Maybe that'll help...dunno. – Clay Feb 23 '19 at 19:41
  • As Basj said, this issue (DeviceIoControl slow) happens only the very first time you run the FSCTL_ENUM_USN_DATA scan (for example, after a reboot). I wrote code (C#) to read the USN journal to get all the fies in a volume, and I noticed that this scan is very fast on Win10 even at the first launch (2 seconds to get and sort 900k files), but slow on Win7 (first run only). @Clay: if you experimented something different, why you don't show the code you used to have those fast accesses? Please share with us. Thanks. – radiolondra Mar 18 '19 at 16:43

0 Answers0