Where LINQ on HashSet vs. List

Question

I need to count the elements of a list/set having property with given value. The list is huge and I need the performance as good as possible. Should I use a list or a set (when having unique elements)? Is there any faster way?

int counter = myList.Where(x => x.A == myValue || x.B == myValue).Count()

This is already inside of AsParallel().ForAll() for another huge list. And no, I can't change that.

Edit

I already saw this question and it does definitely not solve my problem, I am interested on the differences in (P)LINQ queries.

_Should I use a list or a set ?_ Have you actually tried timing them ? — Mat, May 16 '19 at 14:59
Possible duplicate of [HashSet vs. List performance](https://stackoverflow.com/questions/150750/hashset-vs-list-performance) — Leo, May 16 '19 at 15:02
Using `AsParallel().ForAll()` probably eradicates any chance of improving performancy by *preventing* proper parallelization. As for *this* particular query, neither class offers any search speedups since you have to read each item to extract the A and B values. A `HashSet` is slower to traverse though. — Panagiotis Kanavos, May 16 '19 at 15:17
What does the *full* query do? Why not let PLINQ parallelize the entire query instead fo building lists locally? A well-written query could allow PLINQ to partition the source data and use all cores to process the partitions in a single pass. The snippet you posted suggests that complex and slow scans are made for each item in the source data instead. — Panagiotis Kanavos, May 16 '19 at 15:20
...how about `Array` => `Where` => `Length`? would that be faster? — Leo, May 16 '19 at 15:36
Traversing a `List` or `Array` by `for (int idx` will be faster than using `Where`, but it really depends on surrounding code, and how `myList` is created/changed as to whether it would make your whole process faster. There is no substitute for actually profiling and making sure you are speeding up something that matters, or timing to make sure you are actually speeding up. — NetMage, May 16 '19 at 20:27

Sergey Kalinichenko · Accepted Answer · 2019-05-16T20:34:05.160

3

If you are walking a collection in its entirety, walking the entire list is likely to yield better performance than walking the entire set because of the way list elements are allocated in memory (assuming that you are using List<T>, not a linked list).

If you are performing thousands of such queries on the same data in myList, you could get performance improvements by building three look-up tables - on x.A, x.B, and on the common value when x.A == x.B:

var countByA = myList
    .GroupBy(x => x.A)
    .ToDictionary(g => g.Key, g => g.Count());
var countByB = myList
    .GroupBy(x => x.B)
    .ToDictionary(g => g.Key, g => g.Count());
var countByAandB = myList
    .Where(x => x.A == x.B)
    .GroupBy(x => x.A)
    .ToDictionary(g => g.Key, g => g.Count());

Now your query can be converted to three look-ups using the inclusion-exclusion principle:

countByA.TryGetValue(myValue, out var counterA);
countByB.TryGetValue(myValue, out var counterB);
countByAandB.TryGetValue(myValue, out var counterAandB);
int counter = counterA + counterB - counterAandB;

edited May 16 '19 at 20:34

answered May 16 '19 at 15:06

Sergey Kalinichenko

714,442
84
1,110
1,523

I'm sorry, can you please explain `int counter = countByAttributesAandB[(x.A, x.B)];`? – StackLloyd May 16 '19 at 15:19
1

`(x.A, x.B)` constructs a tuple from the A, B values. The dictionary uses tuples and counts as keys. BTW if `myList` is big (thousands of items) you can use `AsParallel()` to speed up the grouping and dictionary construction, eg `myList.AsParallel().GroupBy(...).ToDictionary(...)` – Panagiotis Kanavos May 16 '19 at 15:22
1

@Antiohia what dasblinkenlight did is essentially an *index* on `myList` based on both A and B. You can create separate indexes instead for each attribute. This way the total count would be the sum of the values returned by each "index". You *have* to explain the overall context though. Do you really need the counts? Or just to check for existence? – Panagiotis Kanavos May 16 '19 at 15:27
I think the confusion is in your initialization for `counter` - you need to use `myvalue`, not `x.A` and `x.B`. – NetMage May 16 '19 at 20:25
@NetMage You are right! With this came a realization that I don't need a tuple at all, because the code cares only about pairs where `x.A == x.B`. Thank you! – Sergey Kalinichenko May 16 '19 at 20:37
3

@NetMage Applying the inclusion-exclusion principle requires subtracting the number of objects where both `x.A` and `x.B` are equal to `myValue`, which means that my lookup table do not need to store counts of objects where `x.A != x.B`. – Sergey Kalinichenko May 16 '19 at 20:58
2

This is a brilliant answer! – Theodor Zoulias May 16 '19 at 23:00

Where LINQ on HashSet vs. List

1 Answers1