8

I am using the following query

var queryList1Only = (from file in list1
                                  select file).Except(list2, myFileCompare);

while myFileCompare does a comparison of 2 files based on the name and length.

The query was returning the results if the list1 and list2 were small (say 100 files while I tested), then I increased the list1 to 30,000 files and list2 to 20,000 files and the query now says "Function Evaluation Timed Out".

I searched online and found debugging could cause it, so I removed all the breakpoints and ran the code, now the program just froze, without any output for queryList1Only I am trying to print out to check it.

EDIT: This is the code for myFileCompare

class FileCompare : System.Collections.Generic.IEqualityComparer<System.IO.FileInfo>
    {

        public FileCompare() { }

        public bool Equals(System.IO.FileInfo f1, System.IO.FileInfo f2)
        {
            return (f1.Name == f2.Name && f1.Directory.Name == f2.Directory.Name && 
                    f1.Length == f2.Length);
        }

        // Return a hash that reflects the comparison criteria. According to the 
        // rules for IEqualityComparer<T>, if Equals is true, then the hash codes must
        // also be equal. Because equality as defined here is a simple value equality, not
        // reference identity, it is possible that two or more objects will produce the same
        // hash code.
        public int GetHashCode(System.IO.FileInfo fi)
        {
            string s = String.Format("{0}{1}", fi.Name, fi.Length);
            return s.GetHashCode();
        }

    }
sll
  • 61,540
  • 22
  • 104
  • 156
superstar
  • 1,832
  • 3
  • 17
  • 26
  • 2
    post the code for myFileCompare – mcabral Aug 17 '11 at 19:14
  • Anyway such heavy operations should be executed simultaneously in a separate thread to avoid the situations you've just faced. – sll Aug 17 '11 at 19:16
  • It works fine for 3000 files as well with the above code (tested just now) – superstar Aug 17 '11 at 20:37
  • @All: I am trying to work on the memory issues (if that is the case), by removing fileinfo objects from lists and so on. Thank you all so much for your input and feedback – superstar Aug 17 '11 at 20:53
  • @All: I modified the hashcode (partially taken from sllev) and the equals method in mycompare. I did not use hashset (it did not prove to solve my problem). I am able to get the desired results in less than 5 minutes now on my local machine. Now when I point it to the folders on the server, it freezes (it works on server with 1000 files, i have tested). Atleast some progress :) Thank you all – superstar Aug 18 '11 at 15:42

3 Answers3

3

What are you need to do with the items returned by a query? Basically such heavy operations would be great to execute simultaneously in a separate thread to avoid the situations you've just faced.

EDIT: An idea

As a case you can try following algorithm:

  • Sort items in both arrays using QuickSort (List<T>.Sort() uses it by default), it will be pretty fast with good implementation of GetHashCode()
  • Then in well known for() loop traverse list and compare elements with the same index
  • When count of any array reaches maximum index of an other list - select all items from latter list as different (basically they are not exists in former list at all).

I believe with sorted arrays you'll give much better performance. I believe complexity of Except() is O(m*n).

EDIT: An other idea, should be really fast

  • From one server store items in Set<T>
  • Then loop through second array and search within a Set<T>, it would be VERY fast! Basically O(mlogm) + O(n) because you need to traverse only single array and search within a set with good hash function (use GetHashCode() I've provided with an updated logic) is very quick. Try it out!
// some kind of C# pseudocode ;)
public IEnumerable<FileInfo> GetDifference()
{           
    ISet<FileInfo> firstServerFilesMap = new HashSet<FileInfo>();

    // adding items to set
    firstServerFilesMap.Add();

    List<FileInfo> secondServerFiles = new List<FileInfo>();

    // adding items to list
    firstServerFilesMap.Add();

    foreach (var secondServerFile in secondServerFiles)
    {
        if (!firstServerFilesMap.Contains(secondServerFile))
        {
            yield return secondServerFile;
        }
    }
}

EDIT: More details regarding equality logic were provided in comments

Try out this impelmentation

public bool Equals(System.IO.FileInfo f1, System.IO.FileInfo f2)
{
      if ( f1 == null || f2 == null)
      {
          return false;
      }

      return (f1.Name == f2.Name && f1.Directory.Name == f2.Directory.Name && 
             f1.Length == f2.Length);
}

public int GetHashCode(System.IO.FileInfo fi)
{
    unchecked
    {
        int hash = 17;    
        hash = hash * 23 + fi.Name.GetHashCode();
        hash = hash * 23 + fi.Directory.Name.GetHashCode();
        hash = hash * 23 + fi.Length.GetHashCode();

        return hash;
    }
}

Useful links:

Community
  • 1
  • 1
sll
  • 61,540
  • 22
  • 104
  • 156
  • I am trying to compare files from 2 servers taken into 2 lists. All the program is about this. No other code is existing in this program other than to compare the 2 lists – superstar Aug 17 '11 at 19:16
  • Should be a list fo different items be displayed to a user? And which version of .NET are you using? – sll Aug 17 '11 at 19:18
  • the files that are different from 2 lists should be logged onto a text file – superstar Aug 17 '11 at 19:19
  • But the FullName uses the fullpath of the file. and for files on 2 different servers, you cannot compare based on full name – superstar Aug 17 '11 at 19:25
  • @superstar : yep in this case you right, so equality criteria basically name and directory? Size, last modification time, etc does not matter? – sll Aug 17 '11 at 19:27
  • The Name & directory were sufficient enough to distinguish the files. I didnt find a case, where they did nt work well. So i am fine with it – superstar Aug 17 '11 at 19:29
  • BTW, how you creating instance of FileInfo objects? Is it possible that both fi1 and f2 references the same object? IN this case you can extend Equals() by f1.ReferenceEquals(f2) – sll Aug 17 '11 at 19:46
  • I am having a list of FileInfo Objects, so I am trying to change it to normal list with few values, as it may cause performance issue to load them – superstar Aug 17 '11 at 19:53
  • I am trying to get rid of FileInfo objects, to check if the memory issue is a prob. – superstar Aug 17 '11 at 21:07
  • Except() already works in the way that you are suggesting. The problem is the OPs hash code, which doesn't take directory name into account. If he uses Except() with your equality methods in his FileComparer it should improve performance – saus Aug 18 '11 at 04:27
  • I am not sure about the performance issue with hashcode, but the code works pretty well to distinguish the same file name in different folders as well (any given case) – superstar Aug 18 '11 at 12:56
  • 1
    @superstar : basically if you apply one of the proposed algorithms your function will execute much quicker so hopefully you will get rid of mentioned issue – sll Aug 18 '11 at 13:09
  • @sllev: Your suggestions helped me in the right direction. Thank you very much for your time and your effort to edit the answer, with my changing requirements. Good Luck :) – superstar Aug 18 '11 at 16:58
1

I haven't tried this myself, but here is an idea: Implement list1 as HashSet, this way:

HashSet<FileInfo> List1 = new HashSet<FileInfo>(myFileCompare);

Add all files:

foreach(var file in files)
{
    List1.Add(file);
}

Then remove elements:

List1.ExceptWith(list2);

Then enumerate:

foreach(var file in List1)
{
    //do something
}

I think it's faster, but as I said, I haven't tried it. Here is a link with general information on HashSet.

Edit: Or better yet, you can initialize and add data in one step:

HashSet<FileInfo> List1 = new HashSet<FileInfo>(files, myFileCompare);
Francisco
  • 4,104
  • 3
  • 24
  • 27
-1

I'd recommend removing the length from the hash code, and just doing fi.FullName. That still holds the uniqueness guideline, though there may (under some cases, where you think length is needed to distinguish) be hash collisions. But that is probably preferable to a longer "Except" execution. Similarly, change your equality comparison from being name and directory, to fullname, that would probably be more performant as well.

McKay
  • 12,334
  • 7
  • 53
  • 76
  • As OP already said in comments (see comments for my answer) files are from different servers so full paths are different – sll Aug 17 '11 at 19:48
  • I believe hash collisions could happen whe you are store items in some kind of hastable, in this case we have two straightforward lists – sll Aug 17 '11 at 20:12
  • 1
    @sllev Yes, I am aware that hash collisions could happen, and even stated as much, but that is better if hash code generation is causing performance problems? Sometimes things have the same hash code, that's fine, the important part is that items that are equal have the same hash code. – McKay Aug 17 '11 at 21:43
  • 1
    I noted about collision because OP does not uses any hashtable to store items, he uses lists. And collisions could happen when you store item in hashtable/set/map. I believe this is why someone downvoted your question, but I'm really not sure why just assumption, anyway in terms of recent changes and ideas with Sets your answer makes sense so downvoter should revert his vote – sll Aug 17 '11 at 21:46