0

I have a couple of generic list one with more than 70,000 items and another one with 10 items. I have to find out the items that are not there in second list when comparing with first list. Logically, my code is working fine and I don't face any serious performance issues with smaller set of list. As my first list is more than 70K items, i am facing serious of performance issue. It takes lot of time to execute and get the result.

My question is? Is there any better way of doing this? I cannot live with this performance issue. any suggestions for improvement? I am using C#, .NET 3.5

List<Employee> existingEmployeeList = List of 70K employees;
List<Employee> validEmployeeList = List of 10 employees;

var emloyeeDeletedFilterList = existingEmployeeList.Where(m => !validEmployeeList.Any(p => p.EmployeeId == m.EmployeeId
                            && p.FirstName == m.FirstName
                            && p.Age == m.Age
                            && p.LastName == m.LastName));

I have another operations to find which are newly added to list.

var emloyeeAddedFilterList = validEmployeeList.Where(m => !existingEmployeeList.Any(p => p.EmployeeId == m.EmployeeId
                                && p.FirstName == m.FirstName
                                && p.Age == m.Age
                                && p.LastName == m.LastName));

I have 4 conditions in the where clause to filter employee list.

Edited my question: added one more code snippet

nimi
  • 5,359
  • 16
  • 58
  • 90
  • You should be able to speed things up quite a bit if you re-arrange the conditions to check from most restrictive to least restrictive conditions. Since most of 70K employees are going to pass the `!validEmployeeList.Any` filter, you might as well put it at the end of the conditions. – Sergey Kalinichenko Dec 12 '15 at 02:30
  • What is the primary key of `Employee` entity? Do not forget database basics just because you are using EF. :) . – Kosala W Dec 12 '15 at 04:36

3 Answers3

1

Write a custom EqualityComparer<Employee> to compare on your 4 fields then use .Intersect( to do

var emloyeeFilterList = validEmployeeList.Intersect(existingEmployeeList, new EmployeeComparer()).ToList();

I think it will be faster to call .Intersect( on validEmployeeList instead of existingEmployeeList however I would test it both ways to be sure.

UPDATE:

Oops, misinterpeted what you wanted. The query you would want is to use is Except not Intersect.

If you want all existing employees except valid ones

var emloyeeFilterList = existingEmployeeList.Execpt(validEmployeeList, new EmployeeComparer()).ToList();

or if you want all valid employees except the existing ones.

var emloyeeFilterList = validEmployeeList.Execpt(existingEmployeeList, new EmployeeComparer()).ToList();

Also here is a example of a how to write EmployeeComparer

public class EmployeeComparer : EqualityComparer<Employee>
{
    public override bool Equals(Employee x, Employee y)
    {
         return x.EmployeeId == y.EmployeeId
             && x.FirstName == y.FirstName
             && x.Age == y.Age
             && x.LastName == y.LastName
    }

    //Implmentation taken from http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode
    public override int GetHashCode(Employee obj)
    {
        unchecked // Overflow is fine, just wrap
        {
            int hash = (int) 2166136261;
            // Suitable nullity checks etc, of course :)
            hash = hash * 16777619 ^ obj.EmployeeId.GetHashCode();
            hash = hash * 16777619 ^ obj.FirstName.GetHashCode();
            hash = hash * 16777619 ^ obj.Age.GetHashCode();
            hash = hash * 16777619 ^ obj.LastName.GetHashCode();
            return hash;
        }
    }
}
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • I need to find the items that are not there in existing employee list. Will intersect help me doing it? I am wondering, this will give only the common collections between validEmployeesList and existingEmployeeList right? – nimi Dec 12 '15 at 03:25
  • @nimi oh, sorry updated my answer, you would use `Except`. You would use `Intersect` if you had `validEmployeeList.Any(` instead of `!validEmployeeList.Any(`. I also included a example implementation of `EmployeeComparer` in case you needed help writing it. – Scott Chamberlain Dec 12 '15 at 05:40
  • Wouldn't HashSet be much faster? – Yytsi Dec 12 '15 at 06:40
  • 1
    @TuukkaX Except and Intercect are using [hash sets internally](http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,e289e6c98881b2b8) (The class `Set` you see there is a special implementation of a HashSet that is optimized for use inside LINQ). So no, they would not be faster because they use special HashSets that are faster than the one in `System.Collections.Generic` – Scott Chamberlain Dec 12 '15 at 06:49
1

Consider using a HashSet to store the elements. Hash Set requires you to overload the hash function of your class.

Also, a CPU can compare integers much faster than any other type. Consider implementing more integer comparisons when iterating through elements.

If you are interested in measuring performance, read about Big O methodology of evaluating the relative performance of a program.

Happy Holidays

Jake Psimos
  • 640
  • 3
  • 10
0

For what I can interpret from your code sample, it seems that you just only need to check for 'EmployeeId'. If that isn't the case, you can try the following to avoid using LinQ as it isn't as fast as explicit iterations.

List<Employee> employeeFilterList = new List<Employee>();

for(int a = validEmployeeList.Count; --a >= 0; )
{
    for(int b = existingFieldMappingList.Count; --b >= 0; )
    {
        Employee aEmployee = validEmployeeList[a];
        Employee bEmployee = validEmployeeList[b];
        if (aEmployee.EmployeeId != bEmployee.EmployeeId)
            continue;
        if (aEmployee.FirstName != bEmployee.FirstName)
            continue;
        if (aEmployee.Age != bEmployee.Age)
            continue;
        if (aEmployee.LastName != bEmployee.LastName)
            continue;

        employeeFilterList.Add(aEmployee);
    }
}

EDIT: Keep in mind that you can re-order the IF-conditions in such way the first ones are the most likely to skip to the next iteration.

  • I need to check all the conditions. – nimi Dec 12 '15 at 03:16
  • Ok, can you see a performance gain using the example I provided? If not, I could provide a better approach if you provide how those lists are filled. – Nicolas Rikus Dec 12 '15 at 03:18
  • I have edited my question to add more information. I am doing the changes suggested by you. Will keep you posted. – nimi Dec 12 '15 at 03:20
  • nic, do you have any other solution? I am getting existingemployee list from database and validemployeeList is generated from the UI. – nimi Dec 12 '15 at 03:42
  • LINQ is very fast as long as you are using the correct tool for the job, Instead of using a `foo.Where( x=> !bar.Any(foo.Equals(bar))` use a `foo.Execpt(bar)` and you will get huge performance gains. – Scott Chamberlain Dec 12 '15 at 05:38