2

I have two large lists of items whos class look like this (both lists are of same type):

public class Items
{
 public string ItemID { get; set; }
 public int QuantitySold { get; set; }
}


var oldList = new List<Items>(); // oldList

var newList = new List<Items>(); // new list

The old list contains items from database and the new list represents items fetched from API;

Both lists can be very large with 10000+ items in each (20000 total)

I need to compare items from newList against the items from "oldList" and see which items that have same itemID value, are of different "QuantitySold" value, and those that are of different "QuantitySold" value should be stored in third list called "differentQuantityItems".

I could just simply do double foreach list and compare values but since both of the lists are large the performance with double foreach loop is terrible and I can't do it...

Can someone help me out with this?

@YamamotoTetsua I'm already using a IEqualityComparer to get the desired result, however it doesn't gives the results that I'm expecting. Here is why...I have a first IEqualityComparer which looks like this:

 public class MissingItemComparer : IEqualityComparer<SearchedUserItems>
    {
        public static readonly IEqualityComparer<SearchedUserItems> Instance = new MissingItemComparer();

        public bool Equals(SearchedUserItems x, SearchedUserItems y)
        {
            return x.ItemID == y.ItemID;
        }

        public int GetHashCode(SearchedUserItems x)
        {
            return x.ItemID.GetHashCode();
        }
    }

The usage of this IEqualityComparer basically gives me items from newList that are not present in my database like following:

var missingItems= newItems.Except(competitor.SearchedUserItems.ToList(), MissingItemComparer.Instance).ToList();

Now in this list I will have the list of items which are new from API and are not present in my DB...

Second IEqualityComparer is based on the different QuantitySold from old and new list:

public class ItemsComparer : IEqualityComparer<SearchedUserItems>
    {
        public static readonly IEqualityComparer<SearchedUserItems> Instance = new ItemsComparer();
        public bool Equals(SearchedUserItems x, SearchedUserItems y)
        {
            return (x.QuantitySold == y.QuantitySold);
        }
        public int GetHashCode(SearchedUserItems x)
        {
            return x.ItemID.GetHashCode();
        }
    }

Usage example:

var differentQuantityItems = newItems.Except(competitor.SearchedUserItems.ToList(), ItemsComparer.Instance).ToList();

The issue with these two equality comparers is that first one will for example return these itemID's that are missing:

123124124

123124421

512095902

And they indeed are missing from my oldList... However the second IEQualityComparer will also return these items as differentQuantity items, they indeed are, but the aren't present in the oldList.. So they shouldn't be included in the second list.

User987
  • 3,663
  • 15
  • 54
  • 115
  • 1
    "performance with double foreach loop is terrible and I can't do it." Did you check this by **measuring** or is this just an **assumption**? – MakePeaceGreatAgain Mar 13 '19 at 09:17
  • @HimBromBeere yes tested multiple times, when both of the lists are large execution time is >30 seconds – User987 Mar 13 '19 at 09:18
  • I would say. Join then in a third list and filter based if quanitty from the first equal the quantity from the second or doesn't equal the quanitty from the second this way you have a third list with the values you want – npo Mar 13 '19 at 09:22
  • 3
    Use some fast hash based lookup - e.g. LINQ `Join`, `Dictionary` etc. – Ivan Stoev Mar 13 '19 at 09:22
  • Probably you should look at these issues for comparing two lists with same type based on specific property: https://stackoverflow.com/questions/19790211/comparing-two-lists-according-to-specific-properties & https://stackoverflow.com/questions/23581379/compare-two-lists-of-object-for-new-changed-updated-on-a-specific-property. Both issues provide LINQ over foreach loops. – Tetsuya Yamamoto Mar 13 '19 at 09:22
  • It would be awesome if you could provide a [mcve] with sample inputs and **most importantly** expected outputs. – mjwills Mar 13 '19 at 09:22
  • 1
    Have you tried GetHashCode ? Hashcode Comparison is faster in general. – abhishek Mar 13 '19 at 09:23
  • Guys, I already use IEQualityComparer, first IEqualityComparer returns items that are not present in the oldList, which is fine! Only the issue is with the second one where the second IEqualityComparer returns items with different quantity + the ones that are missing :( – User987 Mar 13 '19 at 09:38
  • How fast does `var results = oldList.Join(newList, left => left.ItemID, right => right.ItemID, (left, right) => new { left, right }).Where(z => z.left.QuantitySold != z.right.QuantitySold).Select(z => z.left);` perform? – mjwills Mar 13 '19 at 09:40
  • `public bool Equals(SearchedUserItems x, SearchedUserItems y) { return (x.QuantitySold == y.QuantitySold); }` really should compare the `ItemID` as well. Otherwise you'll have some fun if two different `ItemID`'s hash to the same hash code. – mjwills Mar 13 '19 at 09:41
  • 1
    I think you´re mixing two points here. You can of course check only for **equality**. Equality will only match **common** properties. You can´t use it to find **differences**. – MakePeaceGreatAgain Mar 13 '19 at 09:47
  • 1
    Are ItemIDs unique in oldList and newList? – Steve Mar 13 '19 at 10:01

4 Answers4

6

This is a perfect candidate for LINQ Join:

var differentQuantityItems =
    (from newItem in newList
     join oldItem in oldList on newItem.ItemID equals oldItem.ItemID
     where newItem.QuantitySold != oldItem.QuantitySold
     select newItem).ToList();

This will return all new items which have corresponding old item with different QuantitySold. If you want to also include the new items without corresponding old item, then use left outer join:

var differentQuantityItems =
    (from newItem in newList
     join oldItem in oldList on newItem.ItemID equals oldItem.ItemID into oldItems
     from oldItem in oldItems.DefaultIfEmpty()
     where oldItem == null || newItem.QuantitySold != oldItem.QuantitySold
     select newItem).ToList();

In both cases, join operator is used to quickly correlate the items with the same ItemID. Then you can compare QuantitySold or any other properties.

Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
  • Hey thanks for the reply Ivan! :) Compared to the IEqualityComparer, how fast would this work ? :) – User987 Mar 13 '19 at 09:45
  • @User987 How should we know? Yours is the measure. – MakePeaceGreatAgain Mar 13 '19 at 09:46
  • But does it return both values from the two list or just the first matching? – Steve Mar 13 '19 at 09:48
  • 1
    @User987 All I can say that this has O(N+M) time complexity, which is times faster than nested loops O(N*M). I don't see a reason of using custom comparers while standard operator with the standard comparer can do the same. – Ivan Stoev Mar 13 '19 at 09:49
  • @Steve If that's needed, the `select` part of the queries can easily be modified to do that, since we have both old and new item at that point. – Ivan Stoev Mar 13 '19 at 09:51
  • No it doesn't contain both values also I suppose that the two lists contain UNIQUES ids and not duplicate ones (for example oldList doesn't contain two equal ids and the same id is present in newList) – Steve Mar 13 '19 at 10:00
  • 1
    @Ivan Stoev I think this is O(n log(n) + m log(m)) instead of O(n + m) – Gabriel Lima Mar 13 '19 at 10:05
  • 1
    @GabrielLima Hmm, and where `log` is coming from? Looks like you are assuming ordered lookups. However hash lookups (used by LINQ) are O(1), so I still believe it's O(N+M) – Ivan Stoev Mar 13 '19 at 10:09
  • @Ivan Stoev I'm assuming the join is sorting the collections "prior to joining them". But that is an implementation detail, its just a guess. – Gabriel Lima Mar 13 '19 at 10:10
  • @GabrielLima Nope. Ordered implementations require `IComparer`. Hash based implementations require `IEqualityComparer`. You can see from `Join` and `GroupJoin` overloads that they use the later. Also the documentation says *The default equality comparer, Default, is used to hash and compare keys.* :) – Ivan Stoev Mar 13 '19 at 10:16
  • @IvanStoev For some reason if I set up 1 item to be of different quantity, returns 2 items in the list after exectution ? xD – User987 Mar 13 '19 at 10:28
  • @IvanStoev Both of the ItemID's are same, somehow it duplicated the item ? It's the first version you posted... – User987 Mar 13 '19 at 10:36
  • 1
    @User987 The only way join can duplicate new item is if the old items contains duplicate ItemIDs. I'm assuming that the ItemID should be unique in both lists. Can you verify that `list.Count == list.Select(e => e.ItemID).Distinct().Count()` holds for both `newItems` and `oldItems`? – Ivan Stoev Mar 13 '19 at 11:09
1

From a big-O complexity point of view, just comparing the lists in a nested for loop would be in the class of O(n*m), being n the size of the list in the DB, and m the size of the list fetched from the API.

What you can do to improve your performance is to sort the two lists, that would cost O(n log(n) + m log(m)), and then you could find the new items in O(n + m). Therefore, the overall complexity of your algorithm would then be in the class of O(n log(n) + m log(m)).

Here's an idea of the time it would take, comparing the quadratic solution to the superlinear one.

Gabriel Lima
  • 348
  • 3
  • 11
1

This code will run in less than a second, even if there are no matches at all (also less than a second if everything is a match).

It will return all items that exists in both lists (i.e. same ItemID) but with a different QuantitySold.

using System;
using System.Collections.Generic;
using System.Linq;

namespace ConsoleApp5
{
    class Program
    {
        public class Items
        {
            public string ItemID { get; set; }
            public int QuantitySold { get; set; }
        }

        static void Main(string[] args)
        {
            // Sample data
            var oldList = new List<Items>();
            oldList.AddRange(Enumerable.Range(0, 20000).Select(z => new Items() { ItemID = z.ToString(), QuantitySold = 4 }));

            var newList = new List<Items>();
            newList.AddRange(Enumerable.Range(0, 20000).Select(z => new Items() { ItemID = z.ToString(), QuantitySold = 5 }));

            var results = oldList.Join(newList,
                                            left => left.ItemID,
                                            right => right.ItemID,
                                            (left, right) => new { left, right })
                                .Where(z => z.left.QuantitySold != z.right.QuantitySold).Select(z => z.left);

            Console.WriteLine(results.Count());
            Console.ReadLine();
        }
    }
}

The use of z.left means only one of the items will be returned - if you want both the old and the new, instead use:

var results = oldList.Join(newList,
                                left => left.ItemID,
                                right => right.ItemID,
                                (left, right) => new { left, right })
                    .Where(z => z.left.QuantitySold != z.right.QuantitySold)
                    .Select(z => new[] { z.left, z.right })
                    .SelectMany(z => z);
mjwills
  • 23,389
  • 6
  • 40
  • 63
  • You essentially are replicating my answer with LINQ method syntax. Which btw shows why LINQ query syntax is much better readable when query contains joins. – Ivan Stoev Mar 13 '19 at 09:57
  • I can see how you may have come to that conclusion @IvanStoev. Apologies if you think I plagiarised (I didn't - it just took me some time to write). There are some subtle differences. I covered the scenario of what if **both** of the values were expected to be returned. I also provided some sample data in the question itself, to make it easier for the OP to benchmark. Nonetheless, as you say, the two solutions are very similar. Your answer is definitely great - which is why I upvoted it. – mjwills Mar 13 '19 at 10:00
0

You can think of using Except clause with custom written IEqualityComparer something like below

var oldList = new List<Item>(); // oldList
var newList = new List<Item>(); // new list
var distinctList = newList.Except(oldList,new ItemEqualityComparer()).ToList();



class ItemEqualityComparer : IEqualityComparer<Item>
        {
            public bool Equals(Item i1, Item i2)
            {
                if (i1.ItemID == i2.ItemID && i1.QuantitySold != i2.QuantitySold)
                    return false;
                return true;
            }

            public int GetHashCode(Item item)
            {
                return item.ItemID.GetHashCode();
            }
        }

public class Item
        {
            public string ItemID { get; set; }
            public int QuantitySold { get; set; }
        }
Asif
  • 329
  • 1
  • 7