1

today I use this to get a list of persons that is not in list A but in list B. It works but seem to take a very long time to get the result. Is there a faster way that do the same?

    var missingPeople = listofPersons.Where(p => allUsedPersons.All(p2 => p2.Id != p.Id)).ToList(); 
MTplus
  • 2,077
  • 4
  • 34
  • 51

4 Answers4

7
  • Your current implementation has O( n * m ) time complexity.

    • Where n is the cardinality of listofPersons.
    • Where m is the cardinality of allUsedPersons.
    • So if you have 500 listofPersons and 200 allUsedPersons your code will take 100,000 checks. That is bad.
      • This is because Linq's Where will run for every item in listofPersons, and inside your Where you have allUsedPersons.All, which will run the p2.Id != p.Id check for every item in allUsedPersons.
  • Instead, use a HashSet<T> to build a set of known values in O(n) time - which then lets you perform exists checks in O(1) time.

    • So if you have 500 listofPersons and 200 allUsedPersons my code below will take only 500 checks.
    • 100,000 vs 500: spot the difference.
HashSet<Int32> allPeopleIds = listofPersons.Select( p => p.Id ).ToHashSet();

List<Person> missingPeople = allUsedPersons
    .Where( p => !allPeopleIds.Contains( p.Id ) )
    .ToList();

In relational-algebra (or is it relational-calculus?) what you're doing is known as an anti-join and Linq supports it via the Except method, however you would need to define a custom-comparator as Linq doesn't yet have an ExceptBy method (but MoreLinq does, though).

Dai
  • 141,631
  • 28
  • 261
  • 374
  • Why `Int32` instead of `int`? – Dominik Jul 13 '21 at 09:32
  • @Dominik That's my personal preference. `Int32` and `int` are always identical: https://stackoverflow.com/questions/62503/should-i-use-int-or-int32 – Dai Jul 13 '21 at 09:32
  • @TimSchmelter Good catch, thanks! – Dai Jul 13 '21 at 09:33
  • Overriding `GetHashCode` in the `Person` class (supposed the lists item are instances of such a custom class) would possibly boost performance. Also overriding Equals to a value comparison and evaluating only the fields/props considered relevant, would be worth wasting one or two thoughts. The uniqueness of the hashcode is the key to performance for Except, Union, Intersect and most of the HashSet operations. – lidqy Jul 13 '21 at 10:30
  • @lidqy Overriding GetHashCode is **completely irrelevant** in this situation and **will not** “boost performance” at all. – Dai Jul 13 '21 at 13:21
  • @lidqy Your comment is lit - but I didn’t post a code answer concerning `Except` ;) - I used `ToHashSet` passing only `Int32` values - so overriding those methods is indeed irrelevant. And my imagination is indeed very precious - you shouldn’t go around imagination-shaming contributors. – Dai Jul 13 '21 at 14:22
  • @Dai In the last section of your answer you mentioned Except and also your preferred solution using Where and HashSet.Contains invokes GetHashCode frequently (each time it invokes Contains). So it's simply untrue that the GetHashCode wouldn't affect performance here. A benchmark will proof. EDIT is true for an HashSet of Ints you dont need an override (it's also impossible) But the OPs example was about some 'Person' entity. – lidqy Jul 13 '21 at 14:31
  • 1
    @Dai Your approach is way faster compared to what I did. Thanks for the solution and the explanation – MTplus Jul 13 '21 at 16:18
2

Another option is to provide a custom, reusable IEqualityComparer<Person>:

public class PersonIdComparer : IEqualityComparer<Person>
{
    public bool Equals(Person x, Person y)
    {
        return x?.Id == y?.Id;
    }

    public int GetHashCode(Person obj)
    {
        return obj?.Id ?? int.MinValue;
    }
}

You can use it for many LINQ methods. In this case you should use it for Except:

var missingPeople = listofPersons.Except(allUsedPersons, new PersonIdComparer()).ToList(); 

Except is quite efficient since it uses a collection similar to HashSet.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
-1

I think below method can help you for improve performance:

List<int> listOfIds = allUsedPersons.select(c => c.Id).ToList();
var missingPeople = listofPersons.Where(r=>!listOfIds.Contains(r.Id));
Parth M. Dave
  • 853
  • 1
  • 5
  • 16
  • 1
    This answer is still `O(n*m)` because `List.Contains` is still `O(n)`. You need to use a hashtable-based collection (like `HashSet` or `Dictionary` for `O(1)` performance). Alternatively, using a tree-based collection will give you `O( log n )` performance, which is still an improvement, though not as good for this particular application. – Dai Jul 13 '21 at 09:30
  • 1
    thanks @Dai for your explanation i really not aware about it thanks alot. – Parth M. Dave Jul 14 '21 at 08:45
-1

Putting something like allUsedPersons.All() within predicate will unnecessarily multiply the number of iterations. Prepare the list of required field(here p.id) beforehand and use it inside predicate.

Assuming allUsedPersons is list A, below would be faster

var usedPersonsId = allUsedPersons.Select(p => p.id).ToList();

var missingPeople = listofPersons.Where(p => !usedPersonsId.Contains(p.Id)).ToList(); 
as-if-i-code
  • 2,103
  • 1
  • 10
  • 19
  • This answer _still_ suffers from the same `O(n*m)` time-complexity as the other answer: don't use `ToList()`. Charitably, I'll add that if you were to sort `usedPersonsId` then you could use a binary-search instead of the `Contains` check and that would give you `O( log n )` runtime - which is better, but still not the best `O(n)` runtime. – Dai Jul 13 '21 at 09:36