2

I'm writing a duplicate file detector. To determine if two files are duplicates I calculate a CRC32 checksum. Since this can be an expensive operation, I only want to calculate checksums for files that have another file with matching size. I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it. Unfortunately, there is an issue at the beginning and end since there will be no previous or next file, respectively. I can fix this using if statements, but it feels clunky. Here is my code:

    public void GetCRCs(List<DupInfo> dupInfos)
    {
        var crc = new Crc32();
        for (int i = 0; i < dupInfos.Count(); i++)
        {
            if (dupInfos[i].Size == dupInfos[i - 1].Size || dupInfos[i].Size == dupInfos[i + 1].Size)
            {
                dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
            }
        }
    }

My question is:

  1. How can I compare each entry to its neighbors without the out of bounds error?

  2. Should I be using a loop for this, or is there a better LINQ or other function?

Note: I did not include the rest of my code to avoid clutter. If you want to see it, I can include it.

Kalev Maricq
  • 617
  • 1
  • 7
  • 24
  • 1
    Instead of starting at 0 start at 1 and end at `dupInfos.Count() -1` – 3dd Jun 23 '15 at 18:46
  • 3
    Another issue you may not have considered... What if there are 4 files with the same size, and the first and 4th files are identical. Your code here will miss those because other non-identical files with the same size are in between them. – DiscipleMichael Jun 23 '15 at 18:51
  • 1
    If you wan't to be sure you find all of the matches, you need to do a compare for each pair in each file-size group-- not just the one above and below. – DiscipleMichael Jun 23 '15 at 18:56
  • 3dd - that's my current tactic, but it feels inelegant. I'm just wondering if there is a better way. DiscipleMichael - My plan is to then sort by size and CRC. I can then only check adjacent files again instead of looping through each file in a size-group. – Kalev Maricq Jun 23 '15 at 19:14
  • You don't have to check every possible pair in the entire set... just every possible pair in a group of same file size – DiscipleMichael Jun 23 '15 at 19:15
  • Good point. And if I subsort each group by CRC, I can get away with only checking adjacent entries. Since comparing each item to every other item in a group is O(N^2) and smart sorting is O(N ln N), it should be faster on large sets. – Kalev Maricq Jun 23 '15 at 19:21

4 Answers4

2

Compute the Crcs first:

// It is assumed that DupInfo.CheckSum is nullable
public void GetCRCs(List<DupInfo> dupInfos)
{
  dupInfos[0].CheckSum = null ;        
  for (int i = 1; i < dupInfos.Count(); i++)
    {
       dupInfos[i].CheckSum = null ;
       if (dupInfos[i].Size == dupInfos[i - 1].Size)
       {
         if (dupInfos[i-1].Checksum==null) dupInfos[i-1].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i-1].FullName));
         dupInfos[i].CheckSum = crc.ComputeChecksum(File.ReadAllBytes(dupInfos[i].FullName));
       }
    }
}

After having sorted your files by size and crc, identify duplicates:

public void GetDuplicates(List<DupInfo> dupInfos) 
{
  for (int i = dupInfos.Count();i>0 i++)
  { // loop is inverted to allow list items deletion
    if (dupInfos[i].Size     == dupInfos[i - 1].Size &&
        dupInfos[i].CheckSum != null &&
        dupInfos[i].CheckSum == dupInfos[i - 1].Checksum)
     { // i is duplicated with i-1
       ... // your code here
       ... // eventually, dupInfos.RemoveAt(i) ; 
     }
   }
}
Graffito
  • 1,658
  • 1
  • 11
  • 10
  • Yep... additionally, the only way you can skip a checksum, is if its the only file of it's size in the list. If you really want to you can exclude those files in an if block. – DiscipleMichael Jun 23 '15 at 19:29
  • In the GetCrcs() procedure, checksum computations are only done when 2 files or more have the same size. When there is only one file of a certain size, its Checksum is null, what is tested in the GetDuplicates() loop. – Graffito Jun 23 '15 at 19:35
  • Did you mean for the last line to say dupInfos[i]...? Good point about Checksum being nullable. I've changed it from uint to uint?. Is it necessary to set it to null or will it be null by default? For the 2nd part, did you mean i--? I like the idea of inverting it to allow for removal. In this situation, each DupInfo just holds info about a file on the disk, so I need to actually perform a delete on it instead of just removing it from the list. Still, I could also remove it from the list. – Kalev Maricq Jun 24 '15 at 12:37
  • yes, its dupInfos[i] on both sides. I just edited the proposed code - Sorry for that cut and paste error! - Checksum is ininitialized to null at first instruction of 1st part loop. – Graffito Jun 24 '15 at 17:20
1

I think the for loop should be : for (int i = 1; i < dupInfos.Count()-1; i++)

var grps= dupInfos.GroupBy(d=>d.Size);
grps.Where(g=>g.Count>1).ToList().ForEach(g=>
{
    ...
});
SKG
  • 152
  • 5
  • I like this idea. I'm having trouble filling in the ForEach part. I'd like to set the CRC property to the ComputeCheckSum of the file for each DupInfo in size groups larger than one. Do you know how I'd do this? – Kalev Maricq Jun 24 '15 at 12:19
1

I have sorted my list of files by size, and am looping through to compare each element to the ones above and below it.

The next logical step is to actually group your files by size. Comparing consecutive files will not always be sufficient if you have more than two files of the same size. Instead, you will need to compare every file to every other same-sized file.

I suggest taking this approach

  1. Use LINQ's .GroupBy to create a collection of files sizes. Then .Where to only keep the groups with more than one file.

  2. Within those groups, calculate the CRC32 checksum and add it to a collection of known checksums. Compare with previously calculated checksums. If you need to know which files specifically are duplicates you could use a dictionary keyed by this checksum (you can achieve this with another GroupBy. Otherwise a simple list will suffice to detect any duplicates.

The code might look something like this:

var filesSetsWithPossibleDupes = files.GroupBy(f => f.Length)
                                      .Where(group => group.Count() > 1);

foreach (var grp in filesSetsWithPossibleDupes)
{
    var checksums = new List<CRC32CheckSum>(); //or whatever type
    foreach (var file in grp)
    {
        var currentCheckSum = crc.ComputeChecksum(file);
        if (checksums.Contains(currentCheckSum))
        {
            //Found a duplicate
        }
        else
        {
            checksums.Add(currentCheckSum);
        }
    }
}

Or if you need the specific objects that could be duplicates, the inner foreach loop might look like

var filesSetsWithPossibleDupes = files.GroupBy(f => f.FileSize)
                                      .Where(grp => grp.Count() > 1);

var masterDuplicateDict = new Dictionary<DupStats, IEnumerable<DupInfo>>();
//A dictionary keyed by the basic duplicate stats
//, and whose value is a collection of the possible duplicates

foreach (var grp in filesSetsWithPossibleDupes)
{
    var likelyDuplicates = grp.GroupBy(dup => dup.Checksum)
                              .Where(g => g.Count() > 1);
    //Same GroupBy logic, but applied to the checksum (instead of file size)

    foreach(var dupGrp in likelyDuplicates)
    {
        //Create the key for the dictionary (your code is likely different)
        var sample = dupGrp.First();
        var key = new DupStats() {FileSize = sample.FileSize, Checksum = sample.Checksum};
        masterDuplicateDict.Add(key, dupGrp);
    }
}

A demo of this idea.

ryanyuyu
  • 6,366
  • 10
  • 48
  • 53
  • Thanks ryanyuyu, I like the idea of using groupby. I'm trying to figure out how to apply step 2. each DupInfo contains a file path, size, checksum field, and a comparison folder field (in case the user wants to only compare files separated by n levels). I'd like to display only files which could be duplicates (same size, checksum, and comparison folder), grouped into sets, to the user and have them choose which to delete. I like the idea of using a dictionary for previous checksums and am trying to figure out how to get from that to my displayed list of duplicate groups. – Kalev Maricq Jun 24 '15 at 12:52
  • Thank you for that idea. I know this is sort of a separate question, so if I need to I'll post it, but I'm using a ListView (WPF) currently for display. I'm not sure how to load the dictionary/data into the listview for display, or even if listview is the right tool. – Kalev Maricq Jun 24 '15 at 16:01
  • @KalevMaricq yeah that's a separate question. You can always include a link to this question for context if it helps. – ryanyuyu Jun 24 '15 at 16:20
0

Can you do a union between your two lists? If you have a list of filenames and do a union it should result in only a list of the overlapping files. I can write out an example if you want but this link should give you the general idea.

https://stackoverflow.com/a/13505715/1856992

Edit: Sorry for some reason I thought you were comparing file name not size.

So here is an actual answer for you.

using System;
using System.Collections.Generic;
using System.Linq;


public class ObjectWithSize
{
    public int Size {get; set;}
    public ObjectWithSize(int size)
    {
        Size = size;
    }
}

public class Program
{
    public static void Main()
    {
        Console.WriteLine("start");
        var list = new List<ObjectWithSize>();
        list.Add(new ObjectWithSize(12));
        list.Add(new ObjectWithSize(13));
        list.Add(new ObjectWithSize(14));
        list.Add(new ObjectWithSize(14));
        list.Add(new ObjectWithSize(18));
        list.Add(new ObjectWithSize(15));
        list.Add(new ObjectWithSize(15));
        var duplicates = list.GroupBy(x=>x.Size)
              .Where(g=>g.Count()>1);
        foreach (var dup in duplicates)
            foreach (var objWithSize in dup)
                Console.WriteLine(objWithSize.Size);
    }
}

This will print out

14
14
15
15

Here is a netFiddle for that. https://dotnetfiddle.net/0ub6Bs

Final note. I actually think your answer looks better and will run faster. This was just an implementation in Linq.

Community
  • 1
  • 1
thinklarge
  • 672
  • 8
  • 24
  • 1
    Thanks thinklarge, I have just one list of files, and need to compare them for any of the same size. I don't think a union is the right tool for this. I've considered joining my list onto itself with the where criteria being that the index is +/- 1, but I'm not sure if that's the best solution. – Kalev Maricq Jun 23 '15 at 19:18
  • Thanks Kalev, Sorry I didn't read your description well enough. I reworked it so that it applies, It's a linq based solution, but it'll have some overhead associated with the group and where. – thinklarge Jun 23 '15 at 19:27
  • 1
    Thanks. I like the idea of using groupby. If a simple loop works faster though, I might as well use it. When I was doing something similar in excel, I just added a record at the beginning and the end and the loop worked fine. In c# though, it seems like I need a different approach. – Kalev Maricq Jun 24 '15 at 13:02