4

I'm running the following MD5 check on 500 million files to check for duplicates. The scripts taking forever to run and I was wondering how to speed it up. How could I speed it up? Could I use a try catch loop instead of contains to throw an error when the hash already exists instead? What would you all recommend?

$folder = Read-Host -Prompt 'Enter a folder path'

$hash = @{}
$lineCheck = 0

Get-ChildItem $folder -Recurse | where {! $_.PSIsContainer} | ForEach-Object {
    $lineCheck++
    Write-Host $lineCheck
    $tempMD5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5).Hash;

    if(! $hash.Contains($tempMD5)){
        $hash.Add($tempMD5,$_.FullName)
    }
    else{
        Remove-Item -literalPath $_.fullname;
    }
} 
Hughes
  • 339
  • 3
  • 15
  • "500 million files" <-- I think that's why it's slow. What kind of storage are you running this on? – Dai Jan 26 '20 at 00:59
  • Still would appreciate any advice on how to speed it up. – Hughes Jan 26 '20 at 00:59
  • If you're saturating your storage's throughput then you cannot speed it up. – Dai Jan 26 '20 at 01:00
  • Most files are duplicates, so the size of the hash remains relatively small... Memory's currently hovering at about 1GB after searching through 150 million files. – Hughes Jan 26 '20 at 01:01
  • 3
    The size of the hash is irrelevant - your computer still needs to read every byte of every file that you're processing - that's the slow part. Reading from lots of small files is also much slower than reading from a few large files - even on SSDs because of how the OS manages file IO and caching. – Dai Jan 26 '20 at 01:03
  • True, but couldn't I parallelize it to speed it up? – Hughes Jan 26 '20 at 01:04
  • @Hughes - [1] is the code running on the system the files are on? pulling the _entire content of the file across the net is slow_. ///// [2] have you checked to see if the code is the choke point? as `Dai` mentioned, the connection may be the problem. – Lee_Dailey Jan 26 '20 at 01:04
  • 1
    PowerShell is not a good language to write parallel code in - and parallelism will not help you at all if your program is IO-bound (instead of CPU-bound) - and I strongly suspect your program is very, very IO-bound. – Dai Jan 26 '20 at 01:05
  • Good question. The files are on the same system the code is running on. – Hughes Jan 26 '20 at 01:05
  • @Hughes - this sounds like a response to my item [1] ... so, what about item [2]? [*grin*] – Lee_Dailey Jan 26 '20 at 01:23
  • 2
    Do what you currently are doing but first on the the length of the file. For each match that you find, create a new table `$LengthMatch[$File1] = $File2` (There should only be a few) Then do your (slow) in-depth MD5 check just on the files in the `$LengthMatch` list. – iRon Jan 26 '20 at 09:01
  • I don't know Windows or Powershell (so take what I say with a pinch of salt) but it seems there is an ability to do background jobs, so maybe you could run 4 or 8 MD5 checksums in parallel... https://mcpmag.com/articles/2018/04/18/background-jobs-in-powershell.aspx – Mark Setchell Jan 26 '20 at 09:50
  • @iRon Great suggestion - files can't be identical if they have different lengths and checking length should be faster than reading entire file - even on Windows. – Mark Setchell Jan 26 '20 at 09:51
  • Have you even measured how long it takes Windows to recurse 500 million files without checksumming them? If they are mostly duplicates, you will be running 500 million `Remove-Item` commands, that in itself could take some time. Maybe you can build a list of items to delete and only delete them when there are say 50/100 items in the list? – Mark Setchell Jan 26 '20 at 09:55
  • Can you run **Windows Subsystem for Linux** or Cygwin? If so, you could use powerful Linux tools like **GNU Parallel**. – Mark Setchell Jan 26 '20 at 09:56

4 Answers4

6

As suggested in the comments, you might consider to start hashing files only if there is a match with the file length found first. Meaning that you will not invoke the expensive hash method for any file length that is unique.

*Note: that the Write-Host command is quite expensive by itself, therefore I would not display every iteration (Write-Host $lineCheck) but e.g. only when a match is found.

$Folder = Read-Host -Prompt 'Enter a folder path'

$FilesBySize = @{}
$FilesByHash = @{}

Function MatchHash([String]$FullName) {
    $Hash = (Get-FileHash -LiteralPath $FullName -Algorithm MD5).Hash
    $Found = $FilesByHash.Contains($Hash)
    If ($Found) {$Null = $FilesByHash[$Hash].Add($FullName)}
    Else {$FilesByHash[$Hash] = [System.Collections.ArrayList]@($FullName)}
    $Found
}

Get-ChildItem $Folder -Recurse | Where-Object -Not PSIsContainer | ForEach-Object {
    $Files = $FilesBySize[$_.Length]
    If ($Files) {
        If ($Files.Count -eq 1) {$Null = MatchHash $Files[0]}
        If ($Files.Count -ge 1) {If (MatchHash $_) {Write-Host 'Found match:' $_.FullName}}
        $Null = $FilesBySize[$_.Length].Add($_.FullName)
    } Else {
        $FilesBySize[$_.Length] = [System.Collections.ArrayList]@($_.FullName)
    }
}

Display the found duplicates:

ForEach($Hash in $FilesByHash.GetEnumerator()) {
    If ($Hash.Value.Count -gt 1) {
        Write-Host 'Hash:' $Hash.Name
        ForEach ($File in $Hash.Value) {
            Write-Host 'File:' $File
        }
    }
}
iRon
  • 20,463
  • 10
  • 53
  • 79
2

I'd guess that the slowest part of your code is the Get-FileHash invocation, since everything else is either not computationally intensive or limited by your hardware (disk IOPS).

You could try replacing it with the invocation of the native tool which has more optimized MD5 implementation and see if it helps.

Could I use a try catch loop instead of contains to throw an error when the hash already exists instead?

Exceptions are slow and using them for flow control is not recommended:

While the use of exception handlers to catch errors and other events that disrupt program execution is a good practice, the use of exception handler as part of the regular program execution logic can be expensive and should be avoided

There is the definitive answer to this from the guy who implemented them - Chris Brumme. He wrote an excellent blog article about the subject (warning - its very long)(warning2 - its very well written, if you're a techie you'll read it to the end and then have to make up your hours after work :) )

The executive summary: they are slow. They are implemented as Win32 SEH exceptions, so some will even pass the ring 0 CPU boundary!

beatcracker
  • 6,714
  • 1
  • 18
  • 41
  • 1
    It looks like your right. Getting the hash for each file is by far the slowest part. According to this article PowerShell only runs on a single core (https://www.experts-exchange.com/questions/29148077/get-filehash-command-speed.html). This makes parallelization difficult. Looks like I'll need to switch to Python or C#. – Hughes Jan 26 '20 at 03:45
1

I know this is a PowerShell question, but you can make good use of parallelization in C#. You also mentioned in one of the comments about using C# as an alternative, so I thought it wouldn't hurt posting a possible implemenation of how it could be done.

You could first create a method to calculate the MD5 Checksum for a file:

private static string CalculateMD5(string filename)
{
    using var md5 = MD5.Create();
    using var stream = File.OpenRead(filename);
    var hash = md5.ComputeHash(stream);
    return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
}

Then you could make a method with queries all file hashes in parellel using ParallelEnumerable.AsParallel():

private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
{
    var allFiles = Directory
        .EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);

    var hashedFiles = allFiles
        .AsParallel()
        .Select(filename => new FileHash { 
            FileName = filename, 
            Hash = CalculateMD5(filename) 
        });

    return hashedFiles;
}

Then you can simply use the above method to delete duplicate files:

private static void DeleteDuplicateFiles(string directoryPath)
{
    var fileHashes = new HashSet<string>();

    foreach (var fileHash in FindFileHashes(directoryPath))
    {
        if (!fileHashes.Contains(fileHash.Hash))
        {
            Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
            fileHashes.Add(fileHash.Hash);
            continue;
        }

        Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
        File.Delete(fileHash.FileName);
    }
}

Full Program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Security.Cryptography;

namespace Test
{
    internal class FileHash
    {
        public string FileName { get; set; }
        public string Hash { get; set; }
    }

    public class Program
    {
        public static void Main()
        { 
            var path = "C:\\Path\To\Files";
            if (File.Exists(path))
            {
                Console.WriteLine($"Deleting duplicate files at {path}");
                DeleteDuplicateFiles(path);
            }
        }

        private static void DeleteDuplicateFiles(string directoryPath)
        {
            var fileHashes = new HashSet<string>();

            foreach (var fileHash in FindFileHashes(directoryPath))
            {
                if (!fileHashes.Contains(fileHash.Hash))
                {
                    Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
                    fileHashes.Add(fileHash.Hash);
                    continue;
                }

                Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
                File.Delete(fileHash.FileName);
            }
        }

        private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
        {
            var allFiles = Directory
                .EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);

            var hashedFiles = allFiles
                .AsParallel()
                .Select(filename => new FileHash { 
                    FileName = filename, 
                    Hash = CalculateMD5(filename) 
                });

            return hashedFiles;
        }

        private static string CalculateMD5(string filename)
        {
            using var md5 = MD5.Create();
            using var stream = File.OpenRead(filename);
            var hash = md5.ComputeHash(stream);
            return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
        }
    }
}
RoadRunner
  • 25,803
  • 6
  • 42
  • 75
0

If you're trying to find duplicates the fastest way to do this is to use something like jdupes or fdupes. These are incredibly performant and written in C.

muloka
  • 53
  • 6