Get duplicate file list by computing their MD5

Question

I have a array which contains a files path, I want to make a list a those file which are duplicate on the basis of their MD5. I calculate their MD5 like this:

private void calcMD5(Array files)  //Array contains a path of all files
{
    int i=0;
    string[] md5_val = new string[files.Length];
    foreach (string file_name in files)
    {
        using (var md5 = MD5.Create())
        {
            using (var stream = File.OpenRead(file_name))
            {
                md5_val[i] = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
                i += 1;
            }
        }
    }                
}

From above I able to calculate their MD5 but how to get only list of those files which are duplicate. If there is any other way to do same please let me know, and also I am new to Linq

score 11 · Accepted Answer · answered Feb 28 '13 at 11:16

1. Rewrite your calcMD5 function to take in a single file path and return the MD5.
2. Store your file names in a string[] or List<string>, not an untyped array, if possible.
3. Use the following LINQ to get groups of files with the same hash:

var groupsOfFilesWithSameHash = files
  // or files.Cast<string>() if you're stuck with an Array
   .GroupBy(f => calcMD5(f))
   .Where(g => g.Count() > 1);

4. You can get to the groups with nested foreach loops, for example:

foreach(var group in groupsOfFilesWithSameHash)
{
    Console.WriteLine("Shared MD5: " + g.Key);
    foreach (var file in group)
        Console.WriteLine("    " + file);
}

Michael Schnerring · Answer 2 · 2013-02-28T11:27:25.070

    static void Main(string[] args)
    {
        // returns a list of file names, which have duplicate MD5 hashes
        var duplicates = CalcDuplicates(new[] {"Hello.txt", "World.txt"});
    }

    private static IEnumerable<string> CalcDuplicates(IEnumerable<string> fileNames)
    {
        return fileNames.GroupBy(CalcMd5OfFile)
                        .Where(g => g.Count() > 1)
                        // skip SelectMany() if you'd like the duplicates grouped by their hashes as group key
                        .SelectMany(g => g);
    }

    private static string CalcMd5OfFile(string path)
    {
        // I took your implementation - I don't know if there are better ones
        using (var md5 = MD5.Create())
        {
            using (var stream = File.OpenRead(path))
            {
                return BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
            }
        }
    }

score 0 · Answer 3 · edited Feb 28 '13 at 11:17

0

var duplicates = md5_val.GroupBy(x => x).Where(x => x.Count() > 1).Select(x => x.Key);

That will give you a list of hashes that are duplicated within the array.

To get names instead of hashes as well:

var duplicates = md5_val.Select((x,i) => new Tuple<string, int>(x, i))
                        .GroupBy(x => x.Item1)
                        .Where(x => x.Count() > 1)
                        .SelectMany(x => files[x.Item2].ToList());

edited Feb 28 '13 at 11:17

Bort

7,398
3
33
48

answered Feb 28 '13 at 11:12

MarcinJuraszek

124,003
15
196
263

I didn't downvote, but I think he wants a list of the duplicated filenames, not the duplicated hashes. – Matthew Watson Feb 28 '13 at 11:14

score 0 · Answer 4 · answered Feb 28 '13 at 11:17

Instead of returning an array of all the files MD5 hashes, do it this way instead:

Have a single 'calculateFileHash()' method.
Create an array of filenames to test for.
Do this:

var dupes = Filenames.GroupBy(fn => calculateFileHash(fn)).Where(gr => gr.Count > 1);

This will return an array of groups, each group being an enumerable containing the filenames with identical content to each other.

Maris · Answer 5 · 2013-02-28T11:39:58.560

0

    private void calcMD5(String[] filePathes)  //Array contains a path of all files
    {
        Dictionary<String, String> hashToFilePathes = new Dictionary<String, String>();
        foreach (string file_name in filePathes)
        {
            using (var md5 = MD5.Create())
            {
                using (var stream = File.OpenRead(file_name))
                {
                    //This will get you dictionary where key is md5hash and value is filepath
                    hashToFilePathes.Add(BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower(), file_name);
                }
            }
        }
        // Here will be all duplicates
        List<String> listOfDuplicates = hashToFilePathes.GroupBy(e => e.Key).Where(e => e.Count() > 1).SelectMany(e=>e).Select(e => e.Value).ToList();
    }
}

edited Feb 28 '13 at 11:39

answered Feb 28 '13 at 11:28

Maris

4,608
6
39
68

This will way will work fastly and looks to much nice. After you can use `listOfDuplicates` as you want. – Maris Feb 28 '13 at 11:31
`.Select(e => e.First().Value)` causes to return one duplicate of each group of duplicates. I assume the filenames aren't duplicates, just their hashes. So in case there are three duplicates in one group, the information it returns is quite useless. I'd recommend `.SelectMany(e => e).Select(e => e.Value)` or leave them grouped entirely. – Michael Schnerring Feb 28 '13 at 11:32
That is the way, but I dont think that it will look better or work faster. I'd recommend to use my first way. – Maris Feb 28 '13 at 11:35
acessing first element in array(which is in memory) is quick operation and there is no need to optimize it. – Maris Feb 28 '13 at 11:36
What do you mean with "look better"? If you'd like to get rid of all duplicates but one, your implementation definately lacks of information. In case there is more than one duplicate in a group. – Michael Schnerring Feb 28 '13 at 11:37
Sorry, yeap now I understand you, you are right. I missed your first idea. Edited my answer – Maris Feb 28 '13 at 11:39

Get duplicate file list by computing their MD5

5 Answers5

Linked