6

How can I find the set of items that occur in 2 or more sequences in a sequence of sequences?

In other words, I want the distinct values that occur in at least 2 of the passed in sequences.

Note: This is not the intersect of all sequences but rather, the union of the intersect of all pairs of sequences.

Note 2: The does not include the pair, or 2 combination, of a sequence with itself. That would be silly.

I have made an attempt myself,

public static IEnumerable<T> UnionOfIntersects<T>(
                                  this IEnumerable<IEnumerable<T>> source)
{
    var pairs =
            from s1 in source
            from s2 in source
            select new { s1 , s2 };

    var intersects = pairs
        .Where(p => p.s1 != p.s2)
        .Select(p => p.s1.Intersect(p.s2));

    return intersects.SelectMany(i => i).Distinct();
}

but I'm concerned that this might be sub-optimal, I think it includes intersects of pair A, B and pair B, A which seems inefficient. I also think there might be a more efficient way to compound the sets as they are iterated.


I include some example input and output below:

{ { 1, 1, 2, 3, 4, 5, 7 }, { 5, 6, 7 }, { 2, 6, 7, 9 } , { 4 } }

returns

{ 2, 4, 5, 6, 7 }

and

{ { 1, 2, 3} } or { {} } or { }

returns

{ }

I'm looking for the best combination of readability and potential performance.


EDIT

I've performed some initial testing of the current answers, my code is here. Output below.

Original valid:True
DoomerOneLine valid:True
DoomerSqlLike valid:True
Svinja valid:True
Adricadar valid:True
Schmelter valid:True
Original 100000 iterations in 82ms
DoomerOneLine 100000 iterations in 58ms
DoomerSqlLike 100000 iterations in 82ms
Svinja 100000 iterations in 1039ms
Adricadar 100000 iterations in 879ms
Schmelter 100000 iterations in 9ms

At the moment, it looks as if Tim Schmelter's answer performs better by at least an order of magnitude.

Community
  • 1
  • 1
Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • 1
    Did this question come from [the one from yesterday](http://stackoverflow.com/questions/30102482/c-sharp-find-a-number-common-to-two-or-more-lists)? – Tim Schmelter May 08 '15 at 08:17
  • @TimSchmelter, yes it does, it turns out the the OP of that question wasn't asking this but it got me thinking. – Jodrell May 08 '15 at 08:18
  • However, i still don't get it. Can you describe in other words what "the union of the intersect of all pairs of sequences" mans? It's f.e. not clear why 6 is included in the desired result in the first example. – Tim Schmelter May 08 '15 at 08:22
  • @TimSchmelter, I edited the question, "I want the distinct values that occur at least 2 of the passed in sets." – Jodrell May 08 '15 at 08:27
  • @Jodrell, I think you are misunderstanding how IEnumerable and Linq work. The "fast" solutions in your test code are fast because they do not run at all (except in the first step where you calculate "valid") because you never evaluate the IEnumerable (try `results[i % 1000] = pair.Value(testData[0]).ToList();` to force evaluation). Adricadar's runs partially (`result.Distinct()` never runs, the rest does). Once you fix this, all the solutions are similar in speed (mine being the fastest). – svinja Dec 21 '15 at 14:32
  • @Jodrell ...and if you test on something like 200 lists of 200 random numbers instead of testing a small list 10000 times, you will notice my and Doomer's solutions are 2 orders of magnitude faster than others. And the Schmelter solution is the slowest of all by far... So your "fastest" solution is in fact the slowest, and your "slowest" solution is the fastest. ;) – svinja Dec 21 '15 at 14:43

5 Answers5

4
// init sequences
var sequences = new int[][]
    { 
        new int[] { 1, 2, 3, 4, 5, 7 },
        new int[] { 5, 6, 7 },
        new int[] { 2, 6, 7, 9 },
        new int[] { 4 }
    };

One-line way:

var result = sequences
    .SelectMany(e => e.Distinct())
    .GroupBy(e => e)
    .Where(e => e.Count() > 1)
    .Select(e => e.Key);

// result is { 2 4 5 7 6 }

Sql-like way (with ordering):

var result = (
          from e in sequences.SelectMany(e => e.Distinct())
          group e by e into g
          where g.Count() > 1
          orderby g.Key
          select g.Key);

// result is { 2 4 5 6 7 }

May be fastest code (but not readable), complexity O(N):

var dic = new Dictionary<int, int>();
var subHash = new HashSet<int>();
int length = array.Length;
for (int i = 0; i < length; i++)
{
    subHash.Clear();
    int subLength = array[i].Length;
    for (int j = 0; j < subLength; j++)
    {
        int n = array[i][j];
        if (!subHash.Contains(n))
        {
            int counter;
            if (dic.TryGetValue(n, out counter))
            {
                // duplicate
                dic[n] = counter + 1;
            }
            else
            {
                // first occurance
                dic[n] = 1;
            }
        }
        else
        {
            // exclude duplucate in sub array
            subHash.Add(n);
        }
    }
}
General-Doomer
  • 2,681
  • 13
  • 13
1

You can skip already Intesected sequences, this way will be a little faster.

public static IEnumerable<T> UnionOfIntersects<T>(this IEnumerable<IEnumerable<T>> source)
{
    var result = new List<T>();
    var sequences = source.ToList();
    for (int sequenceIdx = 0; sequenceIdx < sequences.Count(); sequenceIdx++)
    {
        var sequence = sequences[sequenceIdx];

        for (int targetSequenceIdx = sequenceIdx + 1; targetSequenceIdx < sequences.Count; targetSequenceIdx++)
        {
            var targetSequence = sequences[targetSequenceIdx];
            var intersections = sequence.Intersect(targetSequence);
            result.AddRange(intersections);
        }
    }

    return result.Distinct();
}

How it works?

Input: {/*0*/ { 1, 2, 3, 4, 5, 7 } ,/*1*/ { 5, 6, 7 },/*2*/ { 2, 6, 7, 9 } , /*3*/{ 4 } }

Step 0: Intersect 0 with 1..3

Step 1: Intersect 1 with 2..3 (0 with 1 already has been intersected)

Step 2: Intersect 2 with 3 (0 with 2 and 1 with 2 already has been intersected)

Return: Distinct elements.

Result: { 2, 4, 5, 6, 7 }

You can test it with the below code

var lists = new List<List<int>>
{
    new List<int> {1, 2, 3, 4, 5, 7},
    new List<int> {5, 6, 7},
    new List<int> {2, 6, 7, 9},
    new List<int> {4 }
};

var result = lists.UnionOfIntersects();
adricadar
  • 9,971
  • 5
  • 33
  • 46
1

This should be very close to optimal - how "readable" it is depends on your taste. In my opinion it is also the most readable solution.

        var seenElements = new HashSet<T>();
        var repeatedElements = new HashSet<T>();

        foreach (var list in source)
        {
            foreach (var element in list.Distinct())
            {
                if (seenElements.Contains(element))
                {
                    repeatedElements.Add(element);
                }
                else
                {
                    seenElements.Add(element);
                }
            }
        }

        return repeatedElements;
svinja
  • 5,495
  • 5
  • 25
  • 43
  • 1
    This is less optimal, since you need to hash 3 times, instead of twice with the counting approach in a single Dict. – azt May 08 '15 at 08:46
0

That should nail it:

int[][] test = { new int[] { 1, 2, 3, 4, 5, 7 }, new int[] { 5, 6, 7 }, new int[] { 2, 6, 7, 9 }, new int[] { 4 } };
var result = test.SelectMany(a => a.Distinct()).GroupBy(x => x).Where(g => g.Count() > 1).Select(y => y.Key).ToList();

First you make sure, there are no duplicates in each sequence. Then you join all sequences to a single sequence and look for duplicates as e.g. here.

Community
  • 1
  • 1
azt
  • 2,100
  • 16
  • 25
  • this is near identical to General-Doomer's existing answer http://stackoverflow.com/a/30118957/659190 – Jodrell May 08 '15 at 08:28
0

You can try this approach, it might be more efficient and also allows to specify the minimum intersection-count and the comparer used:

public static IEnumerable<T> UnionOfIntersects<T>(this IEnumerable<IEnumerable<T>> source 
    , int minIntersectionCount
    , IEqualityComparer<T> comparer = null)
{
    if (comparer == null) comparer = EqualityComparer<T>.Default;
    foreach (T item in source.SelectMany(s => s).Distinct(comparer))
    {
        int containedInHowManySequences = 0;
        foreach (IEnumerable<T> seq in source)
        {
            bool contained = seq.Contains(item, comparer);
            if (contained) containedInHowManySequences++;
            if (containedInHowManySequences == minIntersectionCount)
            {
                yield return item;
                break;
            }
        }
    }
}

Some explaining words:

  • It enumerates all unique items in all sequences. Since Distinct is using a set this should be pretty efficient. That can help to speed up in case of many duplicates in all sequences.
  • The inner loop just looks into every sequence if the unique item is contained. Thefore it uses Enumerable.Contains which stops execution as soon as one item was found(so duplicates are no issue).
  • If the intersection-count reaches the minum intersection count this item is yielded and the next (unique) item is checked.
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • I think you might have a problem if a sequence contains duplicates. – Jodrell May 08 '15 at 09:11
  • @Jodrell: why? The outer foreach loops all unique items. Then all sequences are checked with `seq.Contains(item, comparer)`. `Contains` stops as soon as an item was found and increases `containedInHowManySequences` in that case. Enumeration stops if `minCount` is reached. – Tim Schmelter May 08 '15 at 09:13
  • agreed, I overlooked the break; – Jodrell May 08 '15 at 09:35
  • does it work with `{ { 1, 1, 4}, { 2, 3, 4 } }` which should return `{ 4 }`? Not `{ 1, 4 }`. – Jodrell May 08 '15 at 09:45
  • @Jodrell: Yes, this returns `{4}`: `new List>{new List{ 1, 1, 4}, new List{ 2, 3, 4 }}.UnionOfIntersects(2);` – Tim Schmelter May 08 '15 at 09:48
  • 1
    It looks like your algorithm is significantly faster. – Jodrell May 11 '15 at 10:56