1

I'm trying to Find any duplicate files from my computer, I am using length and hash to speed the process,

Someone told me I can improve the speed of my code changing the algorithm of hashing to MD5, I don't know where I have to write that, I copied my code to show you what I'm trying to do.

$srcDir = "C:\Users\Dell\Documents"
Measure-Command {
  Get-ChildItem -Path $srcDir -File -Recurse | Group -Property Length | 
  where { $_.Count -gt 1 } | select -ExpandProperty Group | 
  Get-FileHash -Algorithm MD5 | 
  Group -Property Hash | where { $_.count -gt 1 } | 
  foreach { $_.Group | select Path, Hash }
}
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    You're already using md5 algorithm.... – Santiago Squarzon May 13 '23 at 13:04
  • 1
    I don't think your current code can be improved further it is pretty good as it is, you might have better luck calling the `System.Security.Cryptography.MD5` API directly but its hard to tell if thats gonna be an improvement – Santiago Squarzon May 13 '23 at 14:00
  • Does this answer yiur question: [Powershell Speed: How to speed up ForEach-Object MD5/hash check](https://stackoverflow.com/a/59916692/1701026)? – iRon May 13 '23 at 16:51
  • 1
    If you want ti go faster than that, you will probably need to look into parallel processing, see: [Foreach-Object -Parallel](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/foreach-object#-parallel). Something like (didn't test): `Measure-Command {Get-ChildItem -Path $srcDir -File -Recurse | Group -Property Length | where { $_.Count -gt 1 } | select -ExpandProperty Group | Foreach-Object -Parallel -TrottleLimit 4 { Get-FileHash -Algorithm MD5 } | Group -Property Hash | where { $_.count -gt 1 }| foreach { $_.Group | select Path, Hash` – iRon May 13 '23 at 17:30

2 Answers2

2

It is possible that doing the hashing in parallel improves your current code as iRon pointed out in comments, after doing some testing it does indeed improve efficiency. Here is an implementation that can do the hashing in parallel while being compatible with Windows PowerShell 5.1 and no modules needed.

$srcDir = 'C:\Users\Dell\Documents'
$maxThreads = 6 # Tweak this value for more or less threads
$rs = [runspacefactory]::CreateRunspacePool(1, $maxThreads)
$rs.Open()

$tasks = Get-ChildItem -Path $srcDir -File -Recurse | Group-Object Length |
    Where-Object Count -GT 1 | ForEach-Object {
        $ps = [powershell]::Create().AddScript({
            $args[0] | Get-FileHash -Algorithm MD5 |
                Group-Object Hash |
                Where-Object Count -GT 1
        }).AddArgument($_.Group)

        $ps.RunspacePool = $rs
        
        @{ ps = $ps; iasync = $ps.BeginInvoke() }
    }

$tasks | ForEach-Object {
    try {
        $_.ps.EndInvoke($_.iasync)
    }
    finally {
        if($_.ps) {
            $_.ps.Dispose()
        }
    }
}

if($rs) {
    $rs.Dispose()
}
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
0

Getting a file Hash will always take its time, so you will have to check if the below would be a bit faster

$srcDir = "C:\Users\Dell\Documents"
$files  = Get-ChildItem -Path $srcDir -File -Recurse | Group-Object -Property Length | Where-Object { $_.Count -gt 1 } |
          ForEach-Object { $_.Group | Select-Object FullName, Length, @{Name = 'Hash'; Expression = {($_ | Get-FileHash -Algorithm MD5).Hash}}}
$files | Group-Object Hash | Where-Object { $_.Count -gt 1 } | ForEach-Object {$_.Group}
Theo
  • 57,719
  • 8
  • 24
  • 41