I need to write a Powershell snippet that finds the full path(s) for a given filename over a complete partition as fast as possible.
For the sake of better comparison, I am using this global variables for my code-samples:
$searchDir = "c:\"
$searchName = "hosts"
I started with a small snippet using Get-ChildItem to have a first baseline:
"get-ChildItem"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$result = Get-ChildItem -LiteralPath $searchDir -Filter $searchName -File -Recurse -ea 0
write-host $timer.Elapsed.TotalSeconds "sec."
The runtime on my SSD was 14,8581609 sec.
Next, I tried running the classical DIR-command to see the improvements:
"dir"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$result = &cmd /c dir "$searchDir$searchName" /b /s /a-d
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
This finished in 13,4713342 sec. - not bad, but can we get it faster?
In the third iteration I was testing the same task with ROBOCOPY. Here the code-sample:
"robocopy"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$roboDir = [System.IO.Path]::GetDirectoryName($searchDir)
if (!$roboDir) {$roboDir = $searchDir.Substring(0,2)}
$info = [System.Diagnostics.ProcessStartInfo]::new()
$info.FileName = "$env:windir\system32\robocopy.exe"
$info.RedirectStandardOutput = $true
$info.Arguments = " /l ""$roboDir"" null ""$searchName"" /bytes /njh /njs /np /nc /ndl /xjd /mt /s"
$info.UseShellExecute = $false
$info.CreateNoWindow = $true
$info.WorkingDirectory = $searchDir
$process = [System.Diagnostics.Process]::new()
$process.StartInfo = $info
[void]$process.Start()
$process.WaitForExit()
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
Or in a shorter version (based on the good comments):
"robocopy v2"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$fileList = (&cmd /c pushd $searchDir `& robocopy /l "$searchDir" null "$searchName" /ns /njh /njs /np /nc /ndl /xjd /mt /s).trim() -ne ''
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
Was it faster than DIR? Yes, absolutely! The runtime is now down to 3,2685551 sec. Main reason for this huge improvement is the fact, that ROBOCOPY runs with the /mt-swich in multitask-mode in multiple parallel instances. But even without this turbo-switch is was faster than DIR.
Mission accomplished? Not really - because my task was, to create a powershell-script searching a file as fast as possible, but calling ROBOCOPY is a bit of cheating.
Next, I want to see, how fast we will be by using [System.IO.Directory]. First try was by using getFiles and getDirectory-calls. Here my code:
"GetFiles"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$fileList = [System.Collections.Generic.List[string]]::new()
$dirList = [System.Collections.Generic.Queue[string]]::new()
$dirList.Enqueue($searchDir)
while ($dirList.Count -ne 0) {
$dir = $dirList.Dequeue()
try {
$files = [System.IO.Directory]::GetFiles($dir, $searchName)
if ($files) {$fileList.addRange($file)}
foreach($subdir in [System.IO.Directory]::GetDirectories($dir)) {
$dirList.Enqueue($subDir)
}
} catch {}
}
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
This time the runtime was 19,3393872 sec. By far the slowest code. Can we get it better? Here now a code-snippet with Enumeration-calls for comparison:
"EnumerateFiles"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$fileList = [System.Collections.Generic.List[string]]::new()
$dirList = [System.Collections.Generic.Queue[string]]::new()
$dirList.Enqueue($searchDir)
while ($dirList.Count -ne 0) {
$dir = $dirList.Dequeue()
try {
foreach($file in [System.IO.Directory]::EnumerateFiles($dir, $searchName)) {
$fileList.add($file)
}
foreach ($subdir in [System.IO.Directory]::EnumerateDirectories($dir)) {
$dirList.Enqueue($subDir)
}
} catch {}
}
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
It was only slighly faster with a runtime of 19,2068545 sec.
Now let's see if we can get it faster with direct WinAPI-calls from Kernel32. Here the code. Let's see, how fast it is this time:
"WinAPI"
add-type -Name FileSearch -Namespace Win32 -MemberDefinition @"
public struct WIN32_FIND_DATA {
public uint dwFileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
public uint nFileSizeHigh;
public uint nFileSizeLow;
public uint dwReserved0;
public uint dwReserved1;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
}
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern IntPtr FindFirstFile
(string lpFileName, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindNextFile
(IntPtr hFindFile, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindClose(IntPtr hFindFile);
"@
$rootDir = 'c:'
$searchFile = "hosts"
$fileList = [System.Collections.Generic.List[string]]::new()
$dirList = [System.Collections.Generic.Queue[string]]::new()
$dirList.Enqueue($rootDir)
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$fileData = new-object Win32.FileSearch+WIN32_FIND_DATA
while ($dirList.Count -ne 0) {
$dir = $dirList.Dequeue()
$handle = [Win32.FileSearch]::FindFirstFile("$dir\*", [ref]$fileData)
[void][Win32.FileSearch]::FindNextFile($handle, [ref]$fileData)
while ([Win32.FileSearch]::FindNextFile($handle, [ref]$fileData)) {
if ($fileData.dwFileAttributes -band 0x10) {
$fullName = [string]::Join('\', $dir, $fileData.cFileName)
$dirList.Enqueue($fullName)
} elseif ($fileData.cFileName -eq $searchFile) {
$fullName = [string]::Join('\', $dir, $fileData.cFileName)
$fileList.Add($fullName)
}
}
[void][Win32.FileSearch]::FindClose($handle)
}
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
For me, the result of this approach was quite a negative surprise. The runtime is 17,499286 sec. This is faster than the System.IO-calls but still slower than a simple Get-ChildItem.
But - there is still hope to come close to the super-fast result from ROBOCOPY! For Get-ChildItem we cannot make the call being executes in multi-tasking mode, but for e.g. the Kernel32-calls we have the option to make this a recursive function an call each iteration over all subfolders in a PARALLEL foreach-loop via embedded C#-code. But how to do that?
Does someone know how to change the last code-snippet to use parallel.foreach? Even if the result might not be that fast as ROBOCOPY I would like to post also this approach here to have a full storybook for this classic "file search" topic.
Please let me know, how to do the parallel code-part.
Update: For completeness I am adding the code and runtime of the GetFiles-code running on Powershell 7 with smarter access-handling:
"GetFiles PS7"
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$fileList = [system.IO.Directory]::GetFiles(
$searchDir,
$searchFile,
[IO.EnumerationOptions] @{AttributesToSkip = 'ReparsePoint'; RecurseSubdirectories = $true; IgnoreInaccessible = $true}
)
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
The runtime on my system was 9,150673 sec. - faster than DIR, but still slower than robocopy with multi-tasking on 8 cores.
Update #2: After playing around with the new PS7-features I came up with this code-snippet which uses my first (but ugly?) parallel code-approach:
"WinAPI PS7 parallel"
$searchDir = "c:\"
$searchFile = "hosts"
add-type -Name FileSearch -Namespace Win32 -MemberDefinition @"
public struct WIN32_FIND_DATA {
public uint dwFileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
public uint nFileSizeHigh;
public uint nFileSizeLow;
public uint dwReserved0;
public uint dwReserved1;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
}
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern IntPtr FindFirstFile
(string lpFileName, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindNextFile
(IntPtr hFindFile, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindClose(IntPtr hFindFile);
"@
$rootDir = $searchDir -replace "\\$"
$maxRunSpaces = [int]$env:NUMBER_OF_PROCESSORS
$fileList = [System.Collections.Concurrent.BlockingCollection[string]]::new()
$dirList = [System.Collections.Concurrent.BlockingCollection[string]]::new()
$dirList.Add($rootDir)
$timer = [System.Diagnostics.Stopwatch]::StartNew()
(1..$maxRunSpaces) | ForEach-Object -ThrottleLimit $maxRunSpaces -Parallel {
$dirList = $using:dirList
$fileList = $using:fileList
$fileData = new-object Win32.FileSearch+WIN32_FIND_DATA
$dir = $null
if ($_ -eq 1) {$delay = 0} else {$delay = 50}
if ($dirList.TryTake([ref]$dir, $delay)) {
do {
$handle = [Win32.FileSearch]::FindFirstFile("$dir\*", [ref]$fileData)
[void][Win32.FileSearch]::FindNextFile($handle, [ref]$fileData)
while ([Win32.FileSearch]::FindNextFile($handle, [ref]$fileData)) {
if ($fileData.dwFileAttributes -band 0x10) {
$fullName = [string]::Join('\', $dir, $fileData.cFileName)
$dirList.Add($fullName)
} elseif ($fileData.cFileName -eq $using:searchFile) {
$fullName = [string]::Join('\', $dir, $fileData.cFileName)
$fileList.Add($fullName)
}
}
[void][Win32.FileSearch]::FindClose($handle)
} until (!$dirList.TryTake([ref]$dir))
}
}
$timer.Stop()
write-host $timer.Elapsed.TotalSeconds "sec."
The runtime is now very close to the robocopy-timing. It is actually 4,0809719 sec.
Not bad, but I am still looking for a solution with a parallel.foreach-approach via embedded C# code to make it work also for Powershell v5.
Update #3: Here is now my final code for Powershell 5 running in parallel runspaces:
$searchDir = "c:\"
$searchFile = "hosts"
"WinAPI parallel"
add-type -Name FileSearch -Namespace Win32 -MemberDefinition @"
public struct WIN32_FIND_DATA {
public uint dwFileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
public uint nFileSizeHigh;
public uint nFileSizeLow;
public uint dwReserved0;
public uint dwReserved1;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
}
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern IntPtr FindFirstFile
(string lpFileName, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindNextFile
(IntPtr hFindFile, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindClose(IntPtr hFindFile);
"@
$rootDir = $searchDir -replace "\\$"
$maxRunSpaces = [int]$env:NUMBER_OF_PROCESSORS
$fileList = [System.Collections.Concurrent.BlockingCollection[string]]::new()
$dirList = [System.Collections.Concurrent.BlockingCollection[string]]::new()
$dirList.Add($rootDir)
$timer = [System.Diagnostics.Stopwatch]::StartNew()
$runSpaceList = [System.Collections.Generic.List[PSObject]]::new()
$pool = [RunSpaceFactory]::CreateRunspacePool(1, $maxRunSpaces)
$pool.Open()
foreach ($id in 1..$maxRunSpaces) {
$runSpace = [Powershell]::Create()
$runSpace.RunspacePool = $pool
[void]$runSpace.AddScript({
Param (
[string]$searchFile,
[System.Collections.Concurrent.BlockingCollection[string]]$dirList,
[System.Collections.Concurrent.BlockingCollection[string]]$fileList
)
$fileData = new-object Win32.FileSearch+WIN32_FIND_DATA
$dir = $null
if ($id -eq 1) {$delay = 0} else {$delay = 50}
if ($dirList.TryTake([ref]$dir, $delay)) {
do {
$handle = [Win32.FileSearch]::FindFirstFile("$dir\*", [ref]$fileData)
[void][Win32.FileSearch]::FindNextFile($handle, [ref]$fileData)
while ([Win32.FileSearch]::FindNextFile($handle, [ref]$fileData)) {
if ($fileData.dwFileAttributes -band 0x10) {
$fullName = [string]::Join('\', $dir, $fileData.cFileName)
$dirList.Add($fullName)
} elseif ($fileData.cFileName -like $searchFile) {
$fullName = [string]::Join('\', $dir, $fileData.cFileName)
$fileList.Add($fullName)
}
}
[void][Win32.FileSearch]::FindClose($handle)
} until (!$dirList.TryTake([ref]$dir))
}
})
[void]$runSpace.addArgument($searchFile)
[void]$runSpace.addArgument($dirList)
[void]$runSpace.addArgument($fileList)
$status = $runSpace.BeginInvoke()
$runSpaceList.Add([PSCustomObject]@{Name = $id; RunSpace = $runSpace; Status = $status})
}
while ($runSpaceList.Status.IsCompleted -notcontains $true) {sleep -Milliseconds 10}
$pool.Close()
$pool.Dispose()
$timer.Stop()
$fileList
write-host $timer.Elapsed.TotalSeconds "sec."
The overall runtime with 4,8586134 sec. is a bit slower than the PS7-version, but still much faster than any DIR or Get-ChildItem variation. ;-)
Final Solution: Finally I was able to answer my own question. Here is the final code:
"WinAPI parallel.foreach"
add-type -TypeDefinition @"
using System;
using System.IO;
using System.Collections;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.Runtime.InteropServices;
using System.Threading;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
public class FileSearch {
public struct WIN32_FIND_DATA {
public uint dwFileAttributes;
public System.Runtime.InteropServices.ComTypes.FILETIME ftCreationTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastAccessTime;
public System.Runtime.InteropServices.ComTypes.FILETIME ftLastWriteTime;
public uint nFileSizeHigh;
public uint nFileSizeLow;
public uint dwReserved0;
public uint dwReserved1;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
}
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern IntPtr FindFirstFile
(string lpFileName, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindNextFile
(IntPtr hFindFile, out WIN32_FIND_DATA lpFindFileData);
[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Ansi)]
public static extern bool FindClose(IntPtr hFindFile);
static IntPtr INVALID_HANDLE_VALUE = new IntPtr(-1);
public static class Globals {
public static BlockingCollection<string> resultFileList {get;set;}
}
public static BlockingCollection<string> GetTreeFiles(string path, string searchFile) {
Globals.resultFileList = new BlockingCollection<string>();
List<string> dirList = new List<string>();
searchFile = @"^" + searchFile.Replace(@".",@"\.").Replace(@"*",@".*").Replace(@"?",@".") + @"$";
GetFiles(path, searchFile);
return Globals.resultFileList;
}
static void GetFiles(string path, string searchFile) {
path = path.EndsWith(@"\") ? path : path + @"\";
List<string> dirList = new List<string>();
WIN32_FIND_DATA fileData;
IntPtr handle = INVALID_HANDLE_VALUE;
handle = FindFirstFile(path + @"*", out fileData);
if (handle != INVALID_HANDLE_VALUE) {
FindNextFile(handle, out fileData);
while (FindNextFile(handle, out fileData)) {
if ((fileData.dwFileAttributes & 0x10) > 0) {
string fullPath = path + fileData.cFileName;
dirList.Add(fullPath);
} else {
if (Regex.IsMatch(fileData.cFileName, searchFile, RegexOptions.IgnoreCase)) {
string fullPath = path + fileData.cFileName;
Globals.resultFileList.TryAdd(fullPath);
}
}
}
FindClose(handle);
Parallel.ForEach(dirList, (dir) => {
GetFiles(dir, searchFile);
});
}
}
}
"@
[fileSearch]::GetTreeFiles($searchDir, 'hosts')
And the final runtime is now faster than robocopy with 3,2536388 sec. I also added an optimized version of that code in the solution.