4

I have the following lists:

RakeSnapshots, ProductMovements

Aim is to process the both and get the count of elements that match a condition, as follows:

  • Consider RakeSnapshots with StatusCode == "Dumping"

  • Consider ProductMovement with Status == "InProgress"

  • Fetch the count of all elements both lists, which meet the condition RakeSnapshots.RakeCode equal to ProductMovements.ProductCode

Following are my current options:

// Code 1:

 var resultCount =  ProductMovements.Where(x => RakeSnapshots
                                                .Where(r => r.StatusCode == "Dumping")
                                                .Any(y => y.RakeCode == x.ProductCode  && 
                                                          x.Status == "InProgress"))
                                                .Count();

// Code 2:

var productMovementsInprogress = ProductMovements.Where(x => x.Status == "InProgress");

var rakeSnapShotsDumping = RakeSnapshots.Where(r => r.StatusCode == "Dumping");

var resultCount = productMovementsInprogress.Zip(rakeSnapShotsDumping,(x,y) => (y.RakeCode == x.ProductCode) ?  true : false)
                                            .Where(x => x).Count();

Challenge is both the codes are O(n^2) complexity, is there a way to improve it, this will hurt if the data is very large

Mrinal Kamboj
  • 11,300
  • 5
  • 40
  • 74
  • What does `RakeSnapshots - RakeCode equal to ProductMovements - ProductCode` mean? – Matthew Watson Aug 24 '16 at 07:59
  • We can compare `RakeCode` to `ProductCode` as shown in the examples in the question. Edited, there was copy and paste mistake – Mrinal Kamboj Aug 24 '16 at 08:00
  • The second variant is not equivalent of the first. Does the first return the intended results? – Ivan Stoev Aug 24 '16 at 08:01
  • So do you mean `RakeSnapshots.RakeCode equal to ProductMovements.ProductCode` ? Because `RakeSnapshots - RakeCode equal to ProductMovements - ProductCode` implies some kind of subtraction. – Matthew Watson Aug 24 '16 at 08:02
  • @Matthew Watson apology for that confusion, `RakeCode` belongs to `RakeSnapshots` entity and `ProductCode` belongs to `ProductMovements`, you have understood correctly – Mrinal Kamboj Aug 24 '16 at 08:04
  • @Ivan Stoev Yes the first variant produce the correct result, in fact both the variants apply similar data filters, therefore shall lead to similar results, though second one I haven't tested – Mrinal Kamboj Aug 24 '16 at 08:06
  • @Ivan Stoev first variant is already implemented code, I have realized we can modify that to take the filter `x.Status == "InProgress"` to reduce the overall data set for processing – Mrinal Kamboj Aug 24 '16 at 08:13
  • Can you confirm whether this is linq-to-objects, linq-to-entities, or something else? – Steve Cooper Aug 24 '16 at 09:05
  • @Steve Cooper Linq to Objects – Mrinal Kamboj Aug 24 '16 at 09:47

3 Answers3

3

Sounds like Group Join which (as well as Join) is the most efficient LINQ way of correlating two sets:

var resultCount = ProductMovements.Where(p => p.Status == "InProgress")
    .GroupJoin(RakeSnapshots.Where(r => r.StatusCode == "Dumping"), 
        p => p.ProductCode, r => r.RakeCode, (p, match) => match)
    .Count(match => match.Any());

The time complexity of the above is O(N+M).

Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
3

You can use an inner join to do this:

var dumpingRakeSnapshots       = rakeSnapshots.Where(r => r.StatusCode == "Dumping");
var inProgressProductMovements = productMovements.Where(p => p.Status == "InProgress");

var matches =
    from r in dumpingRakeSnapshots
    join p in inProgressProductMovements on r.RakeCode equals p.ProductCode
    select r;

int count = matches.Count(); // Here's the answer.

Note that (as Ivan Stoev points out) this only works if RakeCode is the primary key of RakeSnapshots.

If it is not, you will have to use a grouped join.

Here's the Linq query syntax version that you should use in that case, but note that this is exactly the same as Ivan's answer (only in Linq query form):

var matches =
    from r in dumpingRakeSnapshots
    join p in inProgressProductMovements on r.RakeCode equals p.ProductCode into gj
    select gj;

For completeness, here's a compilable console app that demonstrates the different results you'll get if RakeCode and ProductCode are not primary keys:

using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApp1
{
    class RakeSnapshot
    {
        public string StatusCode;
        public string RakeCode;
    }

    class ProductMovement
    {
        public string Status;
        public string ProductCode;
    }

    sealed class Program
    {
        void run()
        {
            var rakeSnapshots = new List<RakeSnapshot>
            {
                new RakeSnapshot {StatusCode = "Dumping", RakeCode = "1"},
                new RakeSnapshot {StatusCode = "Dumping", RakeCode = "1"},
                new RakeSnapshot {StatusCode = "Dumping", RakeCode = "2"}
            };

            var productMovements = new List<ProductMovement>
            {
                new ProductMovement {Status = "InProgress", ProductCode = "1"},
                new ProductMovement {Status = "InProgress", ProductCode = "2"},
                new ProductMovement {Status = "InProgress", ProductCode = "2"}
            };

            var dumpingRakeSnapshots       = rakeSnapshots.Where(r => r.StatusCode == "Dumping");
            var inProgressProductMovements = productMovements.Where(p => p.Status == "InProgress");

            // Inner join.

            var matches1 =
                from r in dumpingRakeSnapshots
                join p in inProgressProductMovements on r.RakeCode equals p.ProductCode
                select r;

            Console.WriteLine(matches1.Count());

            // Grouped join.

            var matches2 =
                from r in dumpingRakeSnapshots
                join p in inProgressProductMovements on r.RakeCode equals p.ProductCode into gj
                select gj;

            Console.WriteLine(matches2.Count());

            // OP's code.

            var resultCount = 
                productMovements
                .Count(x => rakeSnapshots
                .Where(r => r.StatusCode == "Dumping")
                .Any(y => y.RakeCode == x.ProductCode && x.Status == "InProgress"));

            Console.WriteLine(resultCount);
        }

        static void Main(string[] args)
        {
            new Program().run();
        }
    }
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • 1
    It really depends if OP code #1 is producing the desired result. Since it's using `Any`, then you should use `join into` with `where`, turning it basically to query syntax for `GroupJoin`. – Ivan Stoev Aug 24 '16 at 08:27
  • Although if the joining fields are unique in their respective sets, then simple `inner join` as in your answer will do it :) – Ivan Stoev Aug 24 '16 at 08:30
  • @IvanStoev That's a good point. I'll update my answer to include the query syntax version of a group join (but that part will be equivalent to your answer at that point). – Matthew Watson Aug 24 '16 at 08:47
  • Actually what you said in your (deleted) previous comment makes perfect sense. And I was not quite correct - if `RakeCode` is the PK of `RakeSnapshots` (e.g. is unique in that set), the simple `inner join` would do. +1 – Ivan Stoev Aug 24 '16 at 08:52
  • Why is this faster? The join operation on linq-to-objects still needs to iterate through all the left table, then for each object on the left, iterate through all the items on the right, for an O(n.m) time complexity. – Steve Cooper Aug 24 '16 at 10:58
  • @SteveCooper I believe that `join` uses some kind of hashing to reduce the complexity to O(N+M) (see [this answer](http://stackoverflow.com/a/2799543/106159)). Of course, the OP should perform tests to see if it really is going to be much more efficient to use a join. – Matthew Watson Aug 24 '16 at 12:57
  • Yes, you can find the implementation of Enumerable.Join() on referencesource, and discover [the `GetGrouping()` method](http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,57b0aa9f1c572776,references) used to implement it, which does a hash lookup. – Matthew Watson Aug 24 '16 at 13:04
  • That's neat. Thanks, @MatthewWatson – Steve Cooper Aug 24 '16 at 14:16
2

Normally, with an O(N^2), you'd look to create an intermediate 'search' data structure which speeds up the lookup. Something like a hash table for O(1) access, or a sorted list for O(log N) access.

Technically, you have two different lists, so the actual order would be O(P.R), where P is the number of product movements, and R is the number of rake snapshots.

In your case, this is your original code;

var resultCount =  ProductMovements
    .Where(x => RakeSnapshots
        .Where(r => r.StatusCode == "Dumping")
        .Any(y => y.RakeCode == x.ProductCode  && 
                  x.Status == "InProgress"))
        .Count();

Is O(P.R) because for each P, the inner where clause is looping through every R. I'd look to creating a Dictionary<T> or HashSet<T>, then transforming your code to something like

var rakeSnapshotSummary = ... magic happens here ...;
var resultCount =  ProductMovements
    .Where(x => rakeSnapshotSummary[x.ProductCode] == true)
    .Count();

In this way, creating the snapshot is O(R), lookup into the data structure is O(1), and creating the result is O(P), for a much healthier O(P+R). I thing that's is as good as it can be.

So my suggestion for your indexing routine would be something like;

var rakeSnapshotSummary = new HashSet<string>(RakeSnapshots
    .Where(r => r.StatusCode == "Dumping")
    .Select(r => r.RakeCode));

This creates a HashSet<string> which will have O(1) time complexity for testing existance of a rake code. Then your final line looks like

var resultCount =  ProductMovements
    .Where(x => x.Status == "InProgress" && rakeSnapshotSummary.Contains(x.ProductCode))
    .Count();

So overall, O(P+R) or, roughly, O(2N) => O(N).

Steve Cooper
  • 20,542
  • 15
  • 71
  • 88
  • 1
    Thanks, I did consider a similar design, but I required a pure linq solution, which is required by a framework for run time execution and don't have a scope for intermediary data structure – Mrinal Kamboj Aug 24 '16 at 11:43