6

What is the most efficient way to find a sequence within a IEnumerable<T> using LINQ

I want to be able to create an extension method which allows the following call:

int startIndex = largeSequence.FindSequence(subSequence)

The match must be adjacent and in order.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
trampster
  • 8,598
  • 4
  • 37
  • 52
  • How large is largeSequence? And is this for practical use or conceptual? Because I can think of a couple ways that would be just fine on a relatively small (a few thousand) records, but wouldn't necessarily be pretty or work in larger environments. – Joe Enos Aug 24 '10 at 23:26
  • I would prefer something that would scale well with large sequences, The actual application is small (only a few hundred elements) however it will be going into our utility class so could be used for much larger sequences in the future. – trampster Aug 24 '10 at 23:29
  • What do you expect FindSequence to return? An index? True/False? The sub-sequence? Do the elements to be matched all have to be in order and adjacent? – Richard Anthony Hein Aug 24 '10 at 23:33
  • @richard I have updated the question to clarify – trampster Aug 24 '10 at 23:37
  • Just for reference, getting a true/false answer is easy enough http://stackoverflow.com/questions/407729/determine-if-a-sequence-contains-all-elements-of-another-sequence-using-linq answers that. Getting the starting index is a bit more complicated. – Richard Anthony Hein Aug 24 '10 at 23:51
  • @richard that answer actually does not care about order or being adjacent. So does not help here. – trampster Aug 25 '10 at 00:01
  • 2
    Yep, the matter of order and adjacency is what makes it a generalisation of string searching. If you stop considering strings as inherently based on a particular char type (which in some languages they aren't) one of the main practical differences left between a string and any other collection is that sorting a collection tends to make it at least as useful if not more, but sorting a string renders it useless ("hello world" != " dehllloorw"). Relatedly, subsequence and substring is essentially the same operation. – Jon Hanna Aug 25 '10 at 00:39

5 Answers5

3

Here's an implementation of an algorithm that finds a subsequence in a sequence. I called the method IndexOfSequence, because it makes the intent more explicit and is similar to the existing IndexOf method:

public static class ExtensionMethods
{
    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence)
    {
        return source.IndexOfSequence(sequence, EqualityComparer<T>.Default);
    }

    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer)
    {
        var seq = sequence.ToArray();

        int p = 0; // current position in source sequence
        int i = 0; // current position in searched sequence
        var prospects = new List<int>(); // list of prospective matches
        foreach (var item in source)
        {
            // Remove bad prospective matches
            prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k]));

            // Is it the start of a prospective match ?
            if (comparer.Equals(item, seq[0]))
            {
                prospects.Add(p);
            }

            // Does current character continues partial match ?
            if (comparer.Equals(item, seq[i]))
            {
                i++;
                // Do we have a complete match ?
                if (i == seq.Length)
                {
                    // Bingo !
                    return p - seq.Length + 1;
                }
            }
            else // Mismatch
            {
                // Do we have prospective matches to fall back to ?
                if (prospects.Count > 0)
                {
                    // Yes, use the first one
                    int k = prospects[0];
                    i = p - k + 1;
                }
                else
                {
                    // No, start from beginning of searched sequence
                    i = 0;
                }
            }
            p++;
        }
        // No match
        return -1;
    }
}

I didn't fully test it, so it might still contain bugs. I just did a few tests on well-known corner cases to make sure I wasn't falling into obvious traps. Seems to work fine so far...

I think the complexity is close to O(n), but I'm not an expert of Big O notation so I could be wrong... at least it only enumerates the source sequence once, whithout ever going back, so it should be reasonably efficient.

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
2

I understand this is an old question, but I needed this exact method and I wrote it up like so:

public static int ContainsSubsequence<T>(this IEnumerable<T> elements, IEnumerable<T> subSequence) where T: IEquatable<T>
{
    return ContainsSubsequence(elements, 0, subSequence);
}

private static int ContainsSubsequence<T>(IEnumerable<T> elements, int index, IEnumerable<T> subSequence) where T: IEquatable<T>
{
    // Do we have any elements left?
    bool elementsLeft = elements.Any();

    // Do we have any of the sub-sequence left?
    bool sequenceLeft = subSequence.Any();

    // No elements but sub-sequence not fully matched
    if (!elementsLeft && sequenceLeft)
        return -1; // Nope, didn't match

    // No elements of sub-sequence, which means even if there are
    // more elements, we matched the sub-sequence fully
    if (!sequenceLeft)
        return index - subSequence.Count(); // Matched!

    // If we didn't reach a terminal condition,
    // check the first element of the sub-sequence against the first element
    if (subSequence.First().Equals(e.First()))
        // Yes, it matched - move onto the next. Consume (skip) one element in each
        return ContainsSubsequence(elements.Skip(1), index + 1 subSequence.Skip(1));
    else
        // No, it didn't match. Try the next element, without consuming an element
        // from the sub-sequence
        return ContainsSubsequence(elements.Skip(1), index + 1, subSequence);
}

Updated to not just return if the sub-sequence matched, but where it started in the original sequence.

This is an extension method on IEnumerable, fully lazy, terminates early and is far more linq-ified than the currently up-voted answer. Bewarned, however (as @wai-ha-lee points out) it is recursive and creates a lot of enumerators. Use it where applicable (performance/memory). This was fine for my needs, but YMMV.

Ani
  • 10,826
  • 3
  • 27
  • 46
  • It *is* recursive, though - aren't you creating a **lot** of iterators to do this? – Wai Ha Lee Nov 16 '15 at 08:53
  • No problem. I was going to suggest manipulating an `IEnumerator` directly when I saw you using `Any()` and `Skip(...)` on the same sequence but then saw that the question asked for a LINQ answer. – Wai Ha Lee Nov 16 '15 at 09:15
2

The code you say you want to be able to use isn't LINQ, so I don't see why it need be implemented with LINQ.

This is essentially the same problem as substring searching (indeed, an enumeration where order is significant is a generalisation of "string").

Since computer science has considered this problem frequently for a long time, so you get to stand on the shoulders of giants.

Some reasonable starting points are:

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://en.wikipedia.org/wiki/Rabin-karp

Even just the pseudocode in the wikipedia articles is enough to port to C# quite easily. Look at the descriptions of performance in different cases and decide which cases are most likely to be encountered by your code.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
1

You can use this library called Sequences to do that (disclaimer: I'm the author).

It has a IndexOfSlice method that does exactly what you need - it's an implementation of the Knuth-Morris-Pratt algorithm.

int startIndex = largeSequence.AsSequence().IndexOfSlice(subSequence);
dcastro
  • 66,540
  • 21
  • 145
  • 155
0

UPDATE: Given the clarification of the question my response below isn't as applicable. Leaving it for historical purposes.

You probably want to use mySequence.Where(). Then the key is to optimize the predicate to work well in your environment. This can vary quite a bit depending on your requirements and typical usage patterns.

It is quite possible that what works well for small collections doesn't scale well for much larger collections depending on what type T is.

Of course, if the 90% use is for small collections then optimizing for the outlier large collection seems a bit YAGNI.

Brad Cunningham
  • 6,402
  • 1
  • 32
  • 39
  • I would like to see how you would use where, which take only a single element to check for a sequence match. Can you provide an example? – trampster Aug 24 '10 at 23:42
  • You're right, after re-reading your question and seeing the update my response isn't applicable. – Brad Cunningham Aug 24 '10 at 23:54