Select Distinct Count is really slow

Question

I have a loop of about 7000 objects and within the loop I need to get a distinct count of a list of structs. Currently I am using -

foreach (var product in productsToSearch)
{
    Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
    var cumulativeCount = 0;
    productStore.Add(product);
    var orderLinesList = totalOrderLines
        .Where(myRows => productStore.Contains(myRows.Sku))
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        });
    var differences = totalOrderLines.Except(orderLinesList);
    cumulativeCount = totalOrderLinsCount - differences.Select(x => x.OrderId).Distinct().Count();
    cumulativeStoreTable.Rows.Add(product, cumulativeCount);      
    Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);
}

public struct OrderLineStruct
{
    public string OrderId { get; set; }
    public string Sku { get; set; }
}

This is extremely slow when getting the distinct count. Anyone know of a more efficient way of doing this? I have tried using MoreLinq which has a DisctintBy method for Linq but it's not more efficient as I have timed it. I have played around with PLinq but I am a little unsure of where to parallelize this query.

So each iteration of the loop is timed at -
Time elapsed: 00:00:37.1142047 start
Time elapsed: 00:00:37.8310148 end

= 0.7168101 secs * 7000 = 5017.6707 (83.627845 minutes)

Its the Distinct() Count() line which is taking the most time to process (around 0.5 secs). The variable differences has a few hundred thousand OrderLineStruct's so doing any linq queries on this is slow.

UPDATE

I have modified the loop a bit and now it runs in around 10 minutes rather that over 1 hour

foreach (var product in productsToSearch)
{
    var cumulativeCount = 0;
    productStore.Add(product);
    var orderLinesList = totalOrderLines
        .Join(productStore, myRows => myRows.Sku, p => p, (myRows, p) => myRows)
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        });
    totalOrderLines = totalOrderLines.Except(orderLinesList).ToList();
    cumulativeCount = totalOrderLinesCount - totalOrderLines.Select(x => x.OrderId).Distinct().Count();
    cumulativeStoreTable.Rows.Add(product, cumulativeCount);
}

Having a .ToList() on the Except seems to make a difference and now I am removing the already processed orders after each iteration which is increasing performance for each iteration.

Well what does the struct look like? A [mcve] would really help - we really don't have much clue for what's going on at the moment. — Jon Skeet, Feb 23 '16 at 15:28
Knowing what the struct looks like or what "really slow" means (one second? Ten seconds? Ten minutes?) would be nice. — Arve Systad, Feb 23 '16 at 15:29
`This is extremely slow..`, can you quantify this? What is it you are expecting and what is it now? Outside of just the code there are also lots of other possible variables like hardware specs and other processes utilizing the CPU/memory. Also everything @Jon said. — Igor, Feb 23 '16 at 15:30
You are not using `product`, so why are you doing this in the loop at all? So my advice is: move it out of the loop and it's already 7000times faster. — Tim Schmelter, Feb 23 '16 at 15:30
@TimSchmelter I'm going to guess the "few hundred thousand" differences come from the `product`. If not, you're very very right. — Arve Systad, Feb 23 '16 at 15:34
@TimSchmelter I am using product. I've updated the question. — James Dev, Feb 23 '16 at 15:41
What is `totalOrderLines`? What are you actually trying to achieve? You are using three collections but we don't know what purpose they have. — Tim Schmelter, Feb 23 '16 at 15:42
The code provided does not really explain what you are trying to achieve. Can you provide the rest of the method, or if not, some pseudo-code of what you are trying to achieve? The main issue is that the for-loop you have provided tells us nothing. — Michael Coxon, Feb 23 '16 at 15:43
@JamesDev No, you are not using `product`, you are using `productStore`. Why not add all the products to the productstore, and move the linq outside of the foreach? — Maarten, Feb 23 '16 at 15:46
@JamesDev Why are you projecting to the struct, when immediately after you are projecting to the OrderId? Why not project immediately to the id. Probably won't make that much of a difference though. — Maarten, Feb 23 '16 at 15:47
Full code sample added. Any way to make differences.Select(x => x.OrderId).Distinct().Count() faster as this is the bottleneck, differences has a few hundred thousand OrderLineStruct — James Dev, Feb 23 '16 at 15:49
None of your LINQ, including your first query containing a nested `Contains` in a `Where`, is actually evaluated until you call `Count` - are you *sure* `Distinct` is the bottleneck? — Preston Guillot, Feb 23 '16 at 15:54
@PrestonGuillot I've timed each part and the distinct() count()line is taking around 0.5 secs — James Dev, Feb 23 '16 at 15:58
How did you time each part? Unless you piped them into something that will actually execute the query (e.g. a plain `Count()` without `Distinct()`) they're bound to be the fastest part as they only set up the query to be ready to run. — Jon Hanna, Feb 23 '16 at 16:33
@JonHanna I printed out the stopwatch elapsed before and after the line. — James Dev, Feb 23 '16 at 16:36
So before and after the line that actually does something? What time do you get if you just call `foreach(var di in differences){}`? — Jon Hanna, Feb 23 '16 at 16:53

Ivan Stoev · Accepted Answer · 2016-02-24T11:36:10.743

2

You are seeking the problem at the wrong place.

The orderLinesList, differences and differences.Select(x => x.OrderId).Distinct() are just LINQ to Objects chained query methods with deferred execution, and the Count() method is executing them all.

Your processing algorithm is highly inefficient. The bottleneck is the orderLinesList query which iterates the whole totalOrderLines list for each product, and this is chained (included) in the Except, Distinct etc. - again, inside the loop, i.e 7000+ times.

Here is a sample efficient algorithm that IMO does the same:

Console.WriteLine("Time elapsed: {0} start", stopwatch.Elapsed);
var productInfo =
(
    from product in productsToSearch
    join line in totalOrderLines on product equals line.Sku into orderLines
    select new { Product = product, OrderLines = orderLines }
).ToList();
var lastIndexByOrderId = new Dictionary<string, int>();
for (int i = 0; i < productInfo.Count; i++)
{
    foreach (var line in productInfo[i].OrderLines)
        lastIndexByOrderId[line.OrderId] = i; // Last wins
}
int cumulativeCount = 0;
for (int i = 0; i < productInfo.Count; i++)
{
    var product = productInfo[i].Product;
    foreach (var line in productInfo[i].OrderLines)
    {
        int lastIndex;
        if (lastIndexByOrderId.TryGetValue(line.OrderId, out lastIndex) && lastIndex == i)
        {
            cumulativeCount++;
            lastIndexByOrderId.Remove(line.OrderId);
        }
    }
    cumulativeStoreTable.Rows.Add(item.Product, cumulativeCount);
    // Remove the next if it was just to support your processing
    productStore.Add(item.Product);
}
Console.WriteLine("Time elapsed: {0} end", stopwatch.Elapsed);

edited Feb 24 '16 at 11:36

answered Feb 23 '16 at 19:03

Ivan Stoev

195,425
15
312
343

Thanks Ivan but this is not quite what i'm looking for but along the right lines. So I have a product list of 7000 products I need to iterate through this list 1 by 1 adding them to some storage (hashset). After each iteration I use the hashset to find the total orderlines so for example i will search product 1 after first iteration then product 1 and product 2 after second iteration. I then need to find the orderlines from the products then find the differences between orderlines and totalorderlines (disctinct by order id). Hope this makes sense? – James Dev Feb 24 '16 at 09:37
That's exactly what my algorithm does, just in a different way. In your example, instead of DistinctOrderCount(all items) - DistinctOrderCount(all items except (product 1 + product2) it does DistinctCount(product 1 + product 2). Isn't that producing the same result? – Ivan Stoev Feb 24 '16 at 09:48
Unfortunately it isn't the correct output. So the reason is that for example on the first iteration if order 1 has product 1 and product 2. Your algorithm matches product 1 and counts it as valid but mine will not as product 2 is missing from the store. Hope this makes sense. – James Dev Feb 24 '16 at 09:54
1

I see. Have to think a bit more. But I guess you got the idea, the initial group join does the necessary separation, then the point is to somehow eliminate (replace with something) the excessive operations inside the loop. I think `Except` will not be needed if we make `productInfo` list (by calling `ToList()`) and then iterate it by index, so the *differences** will be al the items after the current index. So the tricky part is how to eliminate `Distinct` and `Count` (if possible at all). – Ivan Stoev Feb 24 '16 at 10:18
I see you accepted the answer. Meanwhile I've updated it with still fast and hopefully this time correct logic. Please check it out and let me know if it works for you. – Ivan Stoev Feb 24 '16 at 11:38
Your code seemed to produce the same results as before. I have updated my question as I found a way to greatly reduce the processing time of the loops. It may not be the best solution but I am happy with the processing time now. Thanks – James Dev Feb 24 '16 at 12:06
Interesting. Is there a way you can represent all this as pure LINQ (or Lambda expressions - this I prefer)? – Eniola Feb 25 '16 at 14:58

score 1 · Answer 2 · edited Feb 27 '16 at 13:55

I will recommend changing this portion of the your LINQ Query

totalOrderLines.Where(myRows => productStore.Contains(myRows.Sku))

to a Join to read thus:

totalOrderLines.Join(productStore, myRows => myRows.Sku, p => p, (myRows, p) => myRows)

This way you pay the cost once rather than having Contains traverse your product store list 7,000 times which is very inefficient. Also, if it is possible to make your id integral data types (int, long) rather than string, you should have faster searches and comparisons too. But I guess the structure of your model is pretty much set.

Kote · Answer 3 · 2016-02-23T18:47:52.147

In your case the as Jon Hanna mentioned, bottleneck is Except method.
Distinct and Count has second priority.
You can verify this by enforcing enumeration on each part of your method and putting stopwatch around.

foreach (var product in productsToSearch)
{
    var cumulativeCount = 0;
    productStore.Add(product);

    olSw.Start();
    var orderLinesList = totalOrderLines
        .Where(myRows => productStore.Contains(myRows.Sku))
        .Select(myRows => new OrderLineStruct
        {
            OrderId = myRows.OrderId,
            Sku = myRows.Sku
        }).ToList();
    olSw.Stop();

    exSw.Start();
    var differences = totalOrderLines.Except(orderLinesList).ToList();
    exSw.Stop();

    dcSw.Start();
    cumulativeCount = totalOrderLinsCount - differences.Select(x => x.OrderId).Distinct().Count();
    dcSw.Stop();
}

Measurements:
productsToSearch count 100
totalOrderLines count 300 000

Total olSw time: 00:00:01.3583340
Total exSw time: 00:00:14.3304959
Total dcSw time: 00:00:04.1986018

exSw time can be reduced by explicit implementation of GetHashCode at OrderLineStruct

With explicit GetHashCode:

Total olSw time: 00:00:01.4045676
Total exSw time: 00:00:08.4691066
Total dcSw time: 00:00:03.9439711

Total time change without redundant enumeration:
Default GetHashCode Total time: 00:00:18.9649790
Explicit GetHashCode Total time: 00:00:12.7736320

Update:
Also you can optimize this by changing method logic.

For example you can create HashSet from totalOrderLines and then just remove items from it.

var orderLinesList = totalOrderLines
    ... 
    .ToList();

orderLinesList.ForEach(item => totalOrderLines.Remove(item));

cumulativeCount = totalOrderLinsCount - totalOrderLines.Select(x => x.OrderId).Distinct().Count();

In my case it reduces total time to 7 seconds.
Total time: 00:00:07.0851111

In this case enumeration through TotalOrderLines with Dictinct is a bottleneck, but it takes O(N) time, which is ok.

Do you have a faster way to compare two lists and select the differences, rather than using Except? — James Dev, Feb 23 '16 at 17:38
I'm afraid not. According to this answer http://stackoverflow.com/a/2799543/3872935 `Except` method is nice enough. — Kote, Feb 23 '16 at 17:45
You can create `HashSet` from `totalOrderLines` and then just remove items from set `orderLinesList.ForEach(item => totalOrderLines.Remove(item));` instead of using `Except`. It will work faster, but it's a logical change. — Kote, Feb 23 '16 at 18:08

score 0 · Answer 4 · answered Feb 23 '16 at 16:48

0

Where does totalOrderLines originate from? A MSSQL database maybe? If so, you will have to have an index on the OrderId column. Doing a Distinct() without an index on this column forces the DB engine to iterate through all rows to identify the distinct values.

answered Feb 23 '16 at 16:48

Wicher Visser

1,513
12
21

totalOrderLines is a List which is created from a db query so there is no link between this and the database. – James Dev Feb 23 '16 at 16:53

Select Distinct Count is really slow

4 Answers4