1

I need to checksum every single file on a given USB disk in a C# application. I suspect the bottleneck here is the actual read off the disk so I'm looking to make this as fast as possible.

I suspect this would be much quicker if I could read the files on the disk sequentially, in the actual order they appear on the disk (assuming the drive is not fragmented).

How can I find this information for each file from it's standard path? i.e. given a file at "F:\MyFile.txt", how can I find the start location of this file on the disk?

I'm running a C# application in Windows.

GoldieLocks
  • 845
  • 7
  • 22
  • You could http://stackoverflow.com/questions/11934550/get-file-offset-on-disk-cluster-number ... Note that I don't think it is a good idea... Correct that... I **do** think it is a **bad idea** – xanatos Apr 13 '15 at 12:20
  • @xanatos why is it a bad idea? – oleksii Apr 13 '15 at 12:38
  • @oleksii I'm not even sure what permissions you do need... and doing something low-level with disks only because high level methods are slow is not something I would ever do. There is a reason there are high level methods and low level methods. – xanatos Apr 13 '15 at 12:40
  • @xanatos wd-3.com is a "bad" site, I suggest to remove that link. – andreaplanet Feb 24 '19 at 16:46
  • 1
    @andreaplanet Removed the link... it expired :-( Replaced the link in response with https://web.archive.org/web/20160130161216/http://www.wd-3.com/archive/luserland.htm – xanatos Feb 24 '19 at 17:41

2 Answers2

1

If your drives are SSD or based on memory stick technology - forget it.

Memory sticks and other similar devices are generally based on SSD (or similar) technology, where the problem of random read/write access is actually not a problem. So you can just enumerate files and run your checksum.

You can try running this in several threads, but I am not sure that could speed up the process, it's something you may need to test. It may also vary from device to device.

Bonus
@xanatos mentioned an interesting point: "I always noticed that copying thousand of files on a memory stick is much slower than copying a single big file"

It is indeed much faster to copy one big file, rather than a pile of small files. And the reason is (usually) not because the files are located close to each other, so it's easier for hardware to read them sequentially. The problem comes to the OS that needs to keep tracking of each file.

If you ever run a procmon on Windows, you would observe huge amount of FileCreates, FileReads and FileWrites. In order to copy 100 files, OS would open each file, read its content, write to another file, close both files + lots of update operations that are sent to the file system, such as update attributes for both files, update security descriptors for both files, update directory information etc. So one copy operation has many satellite operations.

oleksii
  • 35,458
  • 16
  • 93
  • 163
  • You are probably right, but from empirical testing, I always noticed that copying thousand of files on a memory stick is much slower than copying a single big file... – xanatos Apr 13 '15 at 12:24
  • The drives that will be used for this are all 1TB 2.5 inch standard non-SSD USB3 hard drives, if that makes a difference... – GoldieLocks Apr 13 '15 at 12:33
  • @GoldieLocks Yes it matters then. I will update my answer - it only applies to ssd and mem sticks. I don't know how to read files sequentially (on the low level) on spinning disks. – oleksii Apr 13 '15 at 12:37
1

Now... I don't really know if it will be useful for you:

[StructLayout(LayoutKind.Sequential)]
public struct StartingVcnInputBuffer
{
    public long StartingVcn;
}

public static readonly int StartingVcnInputBufferSizeOf = Marshal.SizeOf(typeof(StartingVcnInputBuffer));

[StructLayout(LayoutKind.Sequential)]
public struct RetrievalPointersBuffer
{
    public uint ExtentCount;
    public long StartingVcn;
    public long NextVcn;
    public long Lcn;
}

public static readonly int RetrievalPointersBufferSizeOf = Marshal.SizeOf(typeof(RetrievalPointersBuffer));

[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]
public static extern SafeFileHandle CreateFileW(
        [MarshalAs(UnmanagedType.LPWStr)] string filename,
        [MarshalAs(UnmanagedType.U4)] FileAccess access,
        [MarshalAs(UnmanagedType.U4)] FileShare share,
        IntPtr securityAttributes,
        [MarshalAs(UnmanagedType.U4)] FileMode creationDisposition,
        [MarshalAs(UnmanagedType.U4)] FileAttributes flagsAndAttributes,
        IntPtr templateFile);

[DllImport("kernel32.dll", ExactSpelling = true, SetLastError = true, CharSet = CharSet.Auto)]
static extern bool DeviceIoControl(IntPtr hDevice, uint dwIoControlCode,
    ref StartingVcnInputBuffer lpInBuffer, int nInBufferSize,
    out RetrievalPointersBuffer lpOutBuffer, int nOutBufferSize,
    out int lpBytesReturned, IntPtr lpOverlapped);

// Returns a FileStream that can only Read
public static void GetStartLogicalClusterNumber(string fileName, out FileStream file, out long startLogicalClusterNumber)
{
    SafeFileHandle handle = CreateFileW(fileName, FileAccess.Read | (FileAccess)0x80 /* FILE_READ_ATTRIBUTES */, FileShare.Read, IntPtr.Zero, FileMode.Open, 0, IntPtr.Zero);

    if (handle.IsInvalid)
    {
        throw new Win32Exception();
    }

    file = new FileStream(handle, FileAccess.Read);

    var svib = new StartingVcnInputBuffer();

    int error;

    RetrievalPointersBuffer rpb;

    int bytesReturned;
    DeviceIoControl(handle.DangerousGetHandle(), (uint)589939 /* FSCTL_GET_RETRIEVAL_POINTERS */, ref svib, StartingVcnInputBufferSizeOf, out rpb, RetrievalPointersBufferSizeOf, out bytesReturned, IntPtr.Zero);

    error = Marshal.GetLastWin32Error();

    switch (error)
    {
        case 38: /* ERROR_HANDLE_EOF */
            startLogicalClusterNumber = -1; // empty file. Choose how to handle
            break;

        case 0: /* NO:ERROR */
        case 234: /* ERROR_MORE_DATA */
            startLogicalClusterNumber = rpb.Lcn;
            break;

        default:
            throw new Win32Exception();
    }
}

Note that the method will return a FileStream that you can keep open and use to read the file, or you can easily modify it to not return it (and not create it) and then reopen the file when you want to hash it.

To use:

string[] fileNames = Directory.GetFiles(@"D:\");

foreach (string fileName in fileNames)
{
    try
    {
        long startLogicalClusterNumber;
        FileStream file;
        GetStartLogicalClusterNumber(fileName, out file, out startLogicalClusterNumber);
    }
    catch (Exception e)
    {
        Console.WriteLine("Skipping: {0} for {1}", fileName, e.Message);
    }
}

I'm using the API described here: https://web.archive.org/web/20160130161216/http://www.wd-3.com/archive/luserland.htm . The program is much easier because you only need the initial Logical Cluster Number (the first version of the code could extract all the LCN extents, but it would be useless, because you have to hash a file from first to last byte). Note that empty files (files with length 0) don't have any cluster allocated. The function returns -1 for the cluster (ERROR_HANDLE_EOF). You can choose how to handle it.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Very, very helpful, thank you! I'd only just started reading through that article, let alone understanding all of it... – GoldieLocks Apr 14 '15 at 14:26
  • The only question I have is that it doesn't just seem to be empty files that return -1, but files which are very small... I understand the logic around empty files not having a cluster, how does this affect very small files? I'm talking 1 line of text, 96 bytes, txt file. – GoldieLocks Apr 14 '15 at 14:27
  • @GoldieLocks http://www.ntfs.com/ntfs_optimization.htm *On NTFS if file is small enough, it can be stored in MFT record itself without using additional clusters.* – xanatos Apr 14 '15 at 14:34