0

I have two List<T> objects (where T is the same type for both objects), and I need to be able to determine whether they contain the same set of values, even if the values aren't in the same order.

Do the objects have any built-in mechanisms to accomplish this, or do I need to write my own algorithm?

Or perhaps, should I be using a different type of collection, rather than List<T>?

If I were to write my own algorithm, it would probably consist of the following steps - I'll try to optimize this in the final version, if I go this route:

  • Do the two collections contain the same number of values? If not return false.
  • Count the number of times each value appears in each collection, return false if the counts aren't equal.
  • If I reach the end of both collections without any inequality in value counts, return true.

I know there are some caveats to this, such as the fact that T has to be comparable - I'm using the default comparison for now (e.g. .Equals() ) with appropriate constraints defined for the generic type.

Giffyguy
  • 20,378
  • 34
  • 97
  • 168
  • Do you definitely need to be able to handle duplicates? – Jon Skeet Oct 20 '15 at 19:53
  • Do the objects have some sort of compare operator? If so, you can sort them then perform a simple comparison. Or add elements to a hashset or dictionary (which only allows unique elements). What if the two lists have the same items, just one list has a duplicate? How do you want to handle that? – iheanyi Oct 20 '15 at 19:53
  • @JonSkeet Yeah, duplicates have to be supported. – Giffyguy Oct 20 '15 at 19:54
  • @iheanyi: Hashset allows you to add the same element multiple times, but because it is a set, only distinct elements will be found in the Hashset. The net effect is that using this method, two lists with the same values but where some values occur a different number of times between the lists would compare as equal. – Eric J. Oct 20 '15 at 19:55
  • And ordering needs to be preserved? What size are these collections? How efficient does this need to be, in memory or speed? – Jon Skeet Oct 20 '15 at 19:55
  • @Giffyguy Then you're not comparing the objects to see if they have the same *set* of values. Sets don't have duplicates. At this point you're comparing two bags to see if they're equal. It's an important distinction. – Servy Oct 20 '15 at 19:55
  • @iheanyi The objects have equality operators and hashcode generators, but they don't have < or > type comparisons. If the two lists have the same items, but one has an extra duplicate, than I would consider that "not equal." – Giffyguy Oct 20 '15 at 19:55
  • @Servy Ah, yeah, thanks for the catch. I should use the correct terminology. – Giffyguy Oct 20 '15 at 19:56
  • @JonSkeet Ordering does not need to be preserved. I always want to plan for the collections getting pretty large, let's say hundreds of thousands of elements at the top end. My web application would like to be able to do this as quickly as possible, naturally - but I'll use whatever CPU time and memory space is required to get the operation correct and consistent. – Giffyguy Oct 20 '15 at 19:59
  • @EricJ. that's why my comment asked about duplicates, I was a bit unsure as to what exactly the OP wanted. – iheanyi Oct 20 '15 at 20:00
  • 1
    Please see [this question on CodeReview](http://codereview.stackexchange.com/questions/68288/faster-way-of-comparing-generic-sets). I'd recommend to pay attention on [this answer](http://codereview.stackexchange.com/a/68425/39186). – Dmitry Oct 20 '15 at 20:05
  • @Dmitry I don't think I can use that solution, since my values aren't unique. I probably should avoid using the term "set" in such instances. :) – Giffyguy Oct 20 '15 at 20:10
  • @Giffyguy The proposed solution doesn't require uniqueness of values. Please look carefully. – Dmitry Oct 20 '15 at 20:15
  • @Dmitry OK, I'll read through it again with this in mind. – Giffyguy Oct 20 '15 at 20:19
  • If you are writing unit tests [`CollectionAssert.AreEquivalent(`](https://msdn.microsoft.com/en-us/library/ms243779.aspx) does this operation. – Scott Chamberlain Oct 20 '15 at 20:24
  • @Giffyguy I should clarify. I'm talking about [this (my) answer](http://codereview.stackexchange.com/a/68425/39186), but not about the accepted one. The accepted answer doesn't accept duplicates. :) – Dmitry Oct 20 '15 at 20:26
  • @ScottChamberlain Oh interesting ... I wonder if I could repurpose that outside of unit tests? I'd have to use a try/catch block I suppose, and I'd want to determine whether they decided to make the operation memory heavy vs CPU heavy. – Giffyguy Oct 20 '15 at 20:26
  • @Dmitry Yeah I was looking at your answer, I just glossed over it a little too quickly. :) Thanks for sharing, I may end up using that solution - also thanks for providing the explicit performance metrics in the answer, very good information. – Giffyguy Oct 20 '15 at 20:28
  • @Dmitry Do you think CollectionAssert.AreEquivalent() would perform similarly to your solution? I'd like to hear your opinion if you have time. – Giffyguy Oct 20 '15 at 20:31
  • I believe they are very similar. You could test them for performance to be sure. Btw, there is another solution based on the `CollectionAssert.AreEquivalent` approach: [Comparing two collections for equality irrespective of the order of items in them](http://stackoverflow.com/questions/50098/comparing-two-collections-for-equality-irrespective-of-the-order-of-items-in-the). – Dmitry Oct 20 '15 at 20:39
  • @Dmitry I ended up using your solution from Code Review. Thanks! – Giffyguy Oct 21 '15 at 19:01

3 Answers3

1

Based on available information, I suspect the most efficient solution that supports duplicates is to

  • Compare the size of the two lists. If unequal, return false. If equal,
  • Sort both lists. If you must retain ordering of the original lists, sort a copy of the lists instead.
  • Compare elements at each position of the sorted lists, returning false if the value at a given position is unequal. Return true if you have compared all elements without finding a difference.

Note that I have assumed that sufficient memory is available for the duration of this operation to create a sorted duplicate of the lists (should order preservation be a requirement).

Eric J.
  • 147,927
  • 63
  • 340
  • 553
1

So we'll start out with just a simple SetEquals, and go from there. HashSet already has an implementation of such a method that can compare two sets for equality, so we can just create a wrapper around that so that we can use it with any type of sequence:

public static bool SetEquals<T>(this IEnumerable<T> first, IEnumerable<T> second,
    IEqualityComparer<T> comparer = null)
{
    return new HashSet<T>(second, comparer ?? EqualityComparer<T>.Default)
        .SetEquals(first);
}

Next, to account for the fact that you have a bag, not a set, we can just take the two sequences that you have, group them, and project that out into a pair that has the item along with the count of matching items. If we do that for both sets then we can compare these sequences of objects as sets, and see if they're set equal. If the key-count pair sequences are both set-equal, then the original sequences are bag-equal:

public static bool BagEquals<T>(
    this IEnumerable<T> first,
    IEnumerable<T> second)
{
    Func<IEnumerable<T>, IEnumerable<KeyValuePair<T, int>>> groupItems =
        sequence => sequence.GroupBy(item => item,
            (key, items) => new KeyValuePair<T, int>(key, items.Count()));
    return groupItems(first)
        .SetEquals(groupItems(second));
}
Servy
  • 202,030
  • 26
  • 332
  • 449
  • I like this approach, especially the Linq GroupBy idea. I'd like to compare this to Dmitry's suggestion. – Giffyguy Oct 20 '15 at 20:20
  • @Giffyguy They're more or less the same. The main difference is that solution can potentially short circuit a bit earlier than mine can. So both will do very close to the same amount of work whenever the sets are the same, and when they're different they use pretty much the same approach, but that solution occasionally realizes the sets are different a bit sooner. Both always need to process the entire first sequence; it's only a question of how much of the second sequence gets processed. – Servy Oct 20 '15 at 20:22
1

Here is a reimplementation of CollectionAssert.AreEquivalent (reference code was decompiled with DotPeek) however instead of throwing a exception it returns a bool.

public class CollectionMethods
{
    public static bool AreEquivalent(ICollection expected, ICollection actual)
    {
        //We can do a few quick tests we can do to get a easy true or easy false.
        //Is one collection null and one not?
        if (Object.ReferenceEquals(expected, null) != Object.ReferenceEquals(actual, null))
            return false;

        //Do they both point at the same object?
        if (Object.ReferenceEquals(expected, actual))
            return true;

        //Do they have diffrent counts?
        if (expected.Count != actual.Count)
            return false;

        //Do we have two empty collections?
        if (expected.Count == 0)
            return true;


        //Ran out of easy tests, now have to do the slow work.
        int nullCount1;
        Dictionary<object, int> elementCounts1 = CollectionMethods.GetElementCounts(expected, out nullCount1);
        int nullCount2;
        Dictionary<object, int> elementCounts2 = CollectionMethods.GetElementCounts(actual, out nullCount2);

        //One last quick check, do the two collections have the same number of null elements?
        if (nullCount2 != nullCount1)
        {
            return false;
        }

        //Check for each element and see if we see them the same number of times in both collections.
        foreach (KeyValuePair<object,int> kvp in elementCounts1)
        {
            int expectedCount = kvp.Value;
            int actualCount;
            elementCounts2.TryGetValue(key, out actualCount);
            if (expectedCount != actualCount)
            {
                return false;
            }
        }
        return true;
    }

    private static Dictionary<object, int> GetElementCounts(ICollection collection, out int nullCount)
    {
        Dictionary<object, int> dictionary = new Dictionary<object, int>();
        nullCount = 0;
        foreach (object key in (IEnumerable)collection)
        {
            if (key == null)
            {
                ++nullCount;
            }
            else
            {
                int num;
                dictionary.TryGetValue(key, out num);
                ++num;
                dictionary[key] = num;
            }
        }
        return dictionary;
    }
}
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431