Efficient way to find the difference between 2 IEnumerables

Question

I have

IEnumerable<Tuple<string, string>> allInfo

and IEnumerable<string> info1dim. What is a way to find effectively the diff between info1dim and first dim of allInfo. For example :

allInfo = {<"data1", "addinfo1">, <"data2", "addinfo2">, <"data3", "addinfo3">"

and

info1dim = {"data3", "data1", "data4"}

The result I expect is

{"diff4"}

What is the most efficient way to do that? I don't want to run two loops. The IEnumerables are huge (~100000 elements)

Assuming .NET has something similar to a Python `set`: Iterate the first enumerable and store all the keys in a set. Then iterate the second one and store them in another set. Now you can simply get the difference between the two sets (`a - b` and `b - a`) to find out which keys are only in the first/second enumerable. — ThiefMaster, Jun 07 '15 at 23:25
"I don't want to run two loops" is an artificial requirement. Your problem *needs* iterating over *both* enumerables. — Ondrej Tucny, Jun 07 '15 at 23:30
@OndrejTucny - you are right. let`s say - "i am looking for effective builtin approach" — YAKOVM, Jun 07 '15 at 23:38
I would seriously consider storing the data in a hashbased collection such as Dictionary (as suggested in my answer). With that you have almost instant look up, and you only need to loop over the IEnumerable once, you never actually loopthrough the dictionary. — BjarkeCK, Jun 07 '15 at 23:40
If the two lists are sorted, you can use two index counter i & j (one for each list). Then compare list1[i] and list2[j]. You have three cases 1) list1[i] == list2[j]. item is in both lists. Increment I & j 1) list1[i] > list2[j]. item is in 2 and not 1. Increment j not i. 2) list1[i] < list2[j] item is in 1 and not 2. increment i not j. — jdweng, Jun 07 '15 at 23:41

score 4 · Accepted Answer · edited Jun 07 '15 at 23:56

The C# HashSet collection has ExceptWith, UnionWith, and IntersectWith methods. What you want could be done like this.

        var set1 = new HashSet<string>(allinfo.Select(t => t.Item1));
        var set2 = new HashSet<string>(info1dim);

        var set1_but_not_set2 = new HashSet<string>(set1);
        set1_but_not_set2.ExceptWith(set2);

        var set2_but_not_set1 = new HashSet<string>(set2);
        set2_but_not_set1.ExceptWith(set1);

Be careful, though, HashSet is a mutable collection and these functions change the collection. You have O(n) operations here. Constructing the HashSet objects requires iterating; so do the ExceptWith operations.

score 2 · Answer 2 · edited May 23 '17 at 12:14

2

You could use a LINQ Except() like so:

info1dim.Except(allInfo.Select(i => i.Item1));

Note that Except() uses a HashSet<T> internally (as explained here) so this is still O(n).

edited May 23 '17 at 12:14

Community

1
1

answered Jun 07 '15 at 23:31

Matt Cole

2,491
17
21

I need info1dim \ allInfo – YAKOVM Jun 07 '15 at 23:34
Corrected (just reverse the statement). – Matt Cole Jun 07 '15 at 23:47

BjarkeCK · Answer 3 · 2015-06-07T23:44:48.950

1

Maybe something like this?

var diff = info1dim.Where(x => allInfo.Any(c => c.Item1 == x) == false);

If you store the IEnumerable<Tuple<string, string>> in a Dictionary<string,string> instead it would become ALOT faster! then you could write:

Dictionary<string,string> allInfo;
IEnumerable<string> info1dim;
var diff = info1dim.Where(x => allInfo.ContainsKey(x) == false);

edited Jun 07 '15 at 23:44

answered Jun 07 '15 at 23:27

BjarkeCK

5,694
5
41
59

Sehnsucht · Answer 4 · 2015-06-07T23:48:08.587

load your info1dim in a HashSet and use Remove foreach item in allInfo :

// n: size of info1dim ; m: size of allInfo
var diff = new HashSet<string> (info1dim); // O(n)
foreach (var tuple in allInfo)  // O(m)
    diff.Remove (tuple.Item1);  // O(1)

I didn't recall of ExceptWith existence before Ollie's answer ; after verifying at the source reference ExceptWith basically do the same (foreach -> Remove) and so should be better ; I keep my code as is as informative support tough

Efficient way to find the difference between 2 IEnumerables

4 Answers4