Subgroup of list

Question

I have a 2 list of string. Is there a simple way to find if one list contains all the strings of the 2nd list?

(By saying simple, I mean that I don't have explicitly compare for each string in one list to all the strings

score 5 · Answer 1 · answered Aug 20 '13 at 13:10

Use Enumerable.Except to find differences between lists. If there is no items in result, then all items from list2 are in list1:

bool containsAll = !list2.Except(list1).Any();

Internally Except uses Set<T> to get unique items from list1 and returns only that items from list2 which are not in set. If there is nothing to return, then all items in set.

Kamil Budziewski · Accepted Answer · 2013-08-20T13:15:00.007

3

Try this:

firstList.All(x=>secondList.Contains(x));

Shorter version (method group):

firstList.All(secondList.Contains)

You need to write using for Linq:

using System.Linq;

It ckeckes wheter All items from first list are in second list. Contains checks if given item is in list. All gives true if all items of collections are matching predicate. Given predicate is: if item is in second list, so whole expresion checks if all items are in second list <- proved working :)

edited Aug 20 '13 at 13:15

answered Aug 20 '13 at 13:09

Kamil Budziewski

22,699
14
85
105

1

LOL, linq questions go so fast these days – David Colwell Aug 20 '13 at 13:09
@DavidColwell especially those simple :) – Kamil Budziewski Aug 20 '13 at 13:10
1

Here also, more readable (and maybe more efficient) with a delegate instead of a lambda: `firstList.All(secondList.Contains)` – Tim Schmelter Aug 20 '13 at 13:16
@TimSchmelter, Gotta love going to a simple LINQ question and learning something. I didn't know that you could pass the method groups in, but it makes so much sense! – David Colwell Aug 20 '13 at 13:22

BrokenGlass · Answer 3 · 2013-08-20T13:16:08.407

3

For larger lists use a HashSet<T> (which results in linear Big O, as opposed to O(n^2) when just using two lists):

var hash = new HashSet<string>(list2);
bool containsAll = list1.All(hash.Contains);

edited Aug 20 '13 at 13:16

answered Aug 20 '13 at 13:11

BrokenGlass

158,293
28
286
335

2

Even shorter `list1.All(hash.Contains)` :) – Tim Schmelter Aug 20 '13 at 13:13
@TimSchmelter ah true, I always let Resharper remind me of method group conversions - much more readable too – BrokenGlass Aug 20 '13 at 13:15
For short lists, i think LINQ O(n^2) would be faster because of the setup time on the hashset, but i havent run the stats... would be interested to see how many results are required before it becomes viable – David Colwell Aug 20 '13 at 13:16
@DavidColwell For some perspective, creating the HashSet is about as expensive as two `Contains` calls on a `List`, so a `List` will likely only be faster if there are one or two items in the list; maybe up to as many as three or four. For collections *that* small, performance is irrelevant. As soon as the collection size is large enough to matter in any way, the `List` solution will be *much* slower. So the end result is that there's really no reason to ever use a `List` to do this. – Servy Aug 20 '13 at 13:30
@TimSchmelter, Was just commenting because the question commented that the search is O instead of O(N^2), realistically this will rarely be the source of performance problems – David Colwell Aug 20 '13 at 13:31
@Servy, See my answer. Your comment is incorrect when testing with a set of size 5,000,000 – David Colwell Aug 20 '13 at 14:23

David Colwell · Answer 4 · 2013-08-20T14:22:14.040

use LINQ

bool isSublistOf = list1.All(list2.Contains);

The All method returns true if the condition in the lambda is met for every element in the IEnumerable. The All is passed the Contains method of List2 as the Func<bool,string> which returns true if the element is found List2. The net effect is that the statement returns true if ALL of the elements in List1 are found in List2.

Performance Note
Due to the nature of the All operator, it is worst case O(n^2), but will exit at the first chance (any mismatch). Using completely random 8 byte strings, I tried out each of the cases using a rudimentary performance harness.

    static void Main(string[] args)
    {
        long count = 5000000;
        //Get 5,000,000 random strings (the superset)
        var strings = CreateRandomStrings(count);

        //Get 1000 random strings (the subset)
        var substrings = CreateRandomStrings(1000);

        //Perform the hashing technique
        var start = DateTime.Now;
        var hash = new HashSet<string>(strings);
        var mid = DateTime.Now;
        var any = substrings.All(hash.Contains);
        var end = DateTime.Now;
        Console.WriteLine("Hashing took " + end.Subtract(start).TotalMilliseconds + " " + mid.Subtract(start).Milliseconds + " of which was setting up the hash");

        //Do the scanning all technique
        start = DateTime.Now;
        any = substrings.All(strings.Contains);
        end = DateTime.Now;
        Console.WriteLine("Scanning took " + end.Subtract(start).TotalMilliseconds);

        //Do the Excepting technique
        start = DateTime.Now;
        any = substrings.Except(strings).Any();
        end = DateTime.Now;
        Console.WriteLine("Excepting took " + end.Subtract(start).TotalMilliseconds);
        Console.ReadKey();

    }

    private static string[] CreateRandomStrings(long count)
    {
        var rng = new Random(DateTime.Now.Millisecond);
        string[] strings = new string[count];
        byte[] bytes = new byte[8];

        for (long i = 0; i < count; i++) {
            rng.NextBytes(bytes);
            strings[i] = Convert.ToBase64String(bytes);
        }
        return strings;
    }

The result ranked them in the following order fairly consistently:

Scanning - ~38ms (list1.All(list2.Contains))
Hashing - ~750ms (749 of which was spent setting up the hashset)
Excepting - 1200ms

The excepting method takes much longer because it requires all the work up front. Unlike the other methods, it will not exit on a mismatch, but continue to process all elements. The Hashing is much faster, but also does significant work up front in setting up the hash. This would be the fastest method if the strings were less random and intersections were more certain.

Disclaimer
All performance tuning at this level is next to irrelevant. This is just a mental exercise only

So, lots of problems here. 1) you're running teach test *only once*. 2) There is no run to warm up the JIT 3) `DateTime` is *horrible* at measuring performance; it just doesn't have enough precision — Servy, Aug 20 '13 at 14:31
Sticking it in a loop results in comparable results. Feel free to post some negative results if you have any — David Colwell, Aug 20 '13 at 14:36
Try running it when `substrings` actually *is* a subset of `strings`. This can be done by, as it's creation process, taking 1000 random strings from `strings`. See how that affects your results. You can also test some "close but not quite" sets in which say 70-95% of the strings are taken from the `strings` collection and the remainder are not. Due to the nature of `All`, which is what we're doing, the first non-match means you're done. Your solution involves *most* not matching. The probability of the first item not matching is actually *very* high. — Servy, Aug 20 '13 at 14:46
Tracking the Intersect of the two, as the subset approaches a complete subset, the hashset rapidly becomes faster. — David Colwell, Aug 20 '13 at 14:49
@DavidColwell Exactly. You were measuring the best case of one algorithm against the worst case of the other, among the other problems. — Servy, Aug 20 '13 at 14:51
"Except" does not need to process all the items due to LINQ's deferred execution. "Any" will also stop immediately. See: http://stackoverflow.com/questions/10110013/order-of-linq-extension-methods-does-not-affect-performance — Tim Schmelter, Aug 20 '13 at 15:52

Subgroup of list

4 Answers4