2

I've written a script to help me identify duplicate files. For some reason if I split these commands and export/import to CSV it runs much faster than if I leave everything in memory. Here is my original code, it is god-awful slow:

Get-ChildItem M:\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\Duplicates.csv

If I split this into 2 commands and export to CSV in the middle it runs about 100x faster. I'm hoping someone could shed some light on what I'm doing wrong.

Get-ChildItem M:\ -recurse | where-object {$_.length -gt 524288000} | select-object Directory, Name | Export-Csv -notypeinformation M:\Misc\Scripts\DuplicateMovies\4.csv

import-csv M:\Misc\Scripts\Duplicates\4.csv | Group-Object directory | ?{$_.count -gt 1} | %{$_.Group} | export-csv -notypeinformation M:\Misc\Scripts\Duplicates\Duplicates.csv

remove-item M:\Misc\Scripts\Duplicates\4.csv

appreciate any suggestions,

~TJ

TJ O
  • 25
  • 3
  • I find it hard to believe that exporting to a file and importing back then filtering is faster that just doing everything in memory. But also, the `select-object Directory, Name` is mispositioned, it should be the last step before exporting. – Santiago Squarzon Oct 18 '22 at 22:11
  • I did the select early on because it was my hope that dropping some of the other properties early in the script would speed things up - I don't care about LastWriteTime for example. I put select-object near the end just now but don't notice any difference. The parent folder has about 10K child folders, it's specifically the group-object portion of the script that takes forever - I gave up after waiting 20 minutes, vs the 10 seconds group-object takes when I use CSV files. – TJ O Oct 18 '22 at 22:21
  • Is this PowerShell 5.1 or PowerShell Core 7+ ? Also, you should note, `.Directory` is not just a string, it's a `DirectoryInfo` object in itself which is massive if you try this with too many files (you are also missing `-File` in your `Get-ChildItem` call). Also your condition for determining if a file is a duplicate seems quite odd, how does grouping the objects by their parent folder help you determine that they're a duplicate or not? – Santiago Squarzon Oct 18 '22 at 22:42
  • The best way to determine if a file is a duplicate is to MD5sum them, or a bit more cumbersome but faster, to sequentially read the bytes and compare them – Santiago Squarzon Oct 18 '22 at 22:45

1 Answers1

4

It's not Group-Object that is slow, it's your grouping condition, you're asking it to group FileInfo objects by their .Directory property which represents their parent folder DirectoryInfo instance. So, you're asking the cmdlet to group objects by a very complex object as a grouping condition, instead you could use the .DirectoryName property as your grouping condition, which represents the parent directory's FullName property (a simple string) or you could use the .Directory.Name property which represents the parent's folder Name (also a simple string).

To summarize, the main reason why exporting to a CSV is faster in this case, is because when Export-Csv receives your objects from pipeline, it calls the ToString() method on each object's property values, hence the Directory instance gets converted to its string representation (calling ToString() to this instance ends up being the folder's FullName).

As for your code, if you want to keep as efficient as possible without actually overcomplicating it:

Get-ChildItem M:\ -Recurse -File | & {
    process {
        if($_.Length -gt 500mb) { $_ }
    }
} | Group-Object DirectoryName | & {
    process {
        if($_.Count -gt 2) {
            foreach($object in $_.Group) {
                [pscustomobject]@{
                    Directory = $_.Name # => This is the Parent Directory FullName
                    Name      = $object.Name
                }
            }
        }
    }
} | Export-Csv M:\Misc\Scripts\Duplicates\4.csv -NoTypeInformation

If you want to group them by the Parent Name instead of FullName, you could use:

Group-Object { $_.Directory.Name }
TJ O
  • 25
  • 3
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37