13

I wrote a LINQ extension method SplitBetween analogous to String.Split.

> new List<int>(){3,4,2,21,3,2,17,16,1}
> .SplitBetween(x=>x>=10)

[3,4,2], [3,2], [], [1]

Source:

// partition sequence into sequence of contiguous subsequences
// behaves like String.Split
public static IEnumerable<IEnumerable<T>> SplitBetween<T>(this IEnumerable<T> source, 
                                                          Func<T, bool> separatorSelector, 
                                                          bool includeSeparator = false)
{
    var l = new List<T>();
    foreach (var x in source)
    {
        if (separatorSelector(x))
        {
            if (includeSeparator)
            {
                l.Add(x);
            }
            yield return l;
            l = new List<T>();
        }
        else
        {
            l.Add(x);
        }
    }
    yield return l;
}

In the spirit of LINQ I think this method ought to do lazy evaluation. However, my implementation does lazy evaluation over the outer IEnumerable, but not over the inner IEnumerable. How can I fix this?

A demonstration of how the outer behaviour is lazy. Assume ThrowingEnumerable<int> is an IEnumerable<int> that explodes when anyone tries to iterate over it (see Skeet's Edulinq).

(new List<int>(){1,2,3,10,1})
.Concat(Extensions.ThrowingEnumerable<int>())
.SplitBetween(x=>x>=10)
.First().ToList();

[1,2,3]

but the inner behaviour isn't lazy

(new List<int>(){1,2,3,10,1})
.Concat(Extensions.ThrowingEnumerable<int>())
.SplitBetween(x=>x>=10)
.ElementAt(2).First();

BOOM

We expect 1 here.

nawfal
  • 70,104
  • 56
  • 326
  • 368
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 3
    To get the C# compiler to do it for you with `yield`, it seems to me that you are definitely going to need there to be **two** methods involved. – AakashM Jul 12 '12 at 13:29
  • See specifically [Sam Saffron's answer](http://stackoverflow.com/a/10425490/50776) or [my answer](http://stackoverflow.com/a/419058/50776) on how to achieve this without using a list in between. Sam's is better for what you want when applying more operations (such as `ElementAt`) on the result. – casperOne Jul 12 '12 at 13:30
  • Ok, so you say it can be done, if one wades deep into the enumerators. – Colonel Panic Jul 12 '12 at 13:40
  • 1
    Is this an exact duplicate? The 'possible duplicate' linked asks about Lists and never mentions lazy evaluation. – Colonel Panic Jul 12 '12 at 13:41
  • 3
    If your list was {2, 3, throw, 5, 10, 2, 3} and you wanted the second list, you'll still get the exception - even with laziness. The throwing item must be enumerated to reach the next group. – Amy B Jul 12 '12 at 18:51
  • 1
    Does lazy evaluation even make sense for the inner sequences? Part of lazy evaluation is that the same enumerator will produce a different sequence if the underlying collection changes. Suppose you enumerate the outer sequence and store that in an array, then change the underlying collection, and then try to enumerate the inner sequences you stored. Lazy evaluation fails here. Should they operate on the new underlying collection? The new sequence might have a different number of subsequences. Should they operate on the old data? If the evaluation is actually lazy, that data is gone. – Henry Merriam Feb 18 '13 at 00:57

4 Answers4

3

Edit: There is nothing wrong with your approach, except that a throwing enumerable will really "boom" when you enumerate it. Thats what's its meant for. It doesn't have a proper GetEnumerator defined on it. So your code exhibits no real problem. In the first case by doing First, you're only enumerating till the first result set (just { 1, 2, 3 } ) is obtained and not enumerating the throwing enumerable (which means Concat is not being executed). But in the second example, you're asking for element at 2 after the split, which means it will enumerate the throwing enumerable too and will go "boom". The key here is to understand ElementAt enumerates the collection till the index asked to and is not inherently lazy (it cant be).

I'm not sure if fully lazy is the way to go here. The problem is that the whole process of splitting lazily into outer and inner sequences runs on one enumerator which can yield different results depending on enumerator state. For instance you enumerate only the outer sequence, the inner sequences no longer will be what you expect. Or if you enumerate only half the outer sequence and one inner sequence, what will be the state of other inner sequences? Your approach is the best.

The below approach is lazy (still will boom since that's warranted) in that it uses no intermediate concrete implementations, but can be slower than your original approach because it traverses the list more than once:

public static IEnumerable<IEnumerable<T>> SplitBy<T>(this IEnumerable<T> source, 
                                                     Func<T, bool> separatorPredicate, 
                                                     bool includeEmptyEntries = false,
                                                     bool includeSeparators = false)
{
    int prevIndex = 0;
    int lastIndex = 0;
    var query = source.Select((t, index) => { lastIndex = index; return new { t, index }; })
                      .Where(a => separatorPredicate(a.t));
    foreach (var item in query)
    {
        if (item.index == prevIndex && !includeEmptyEntries)
        {
            prevIndex++;
            continue;
        }

        yield return source.Skip(prevIndex)
                           .Take(item.index - prevIndex + (!includeSeparators ? 0 : 1));
        prevIndex = item.index + 1;
    }

    if (prevIndex <= lastIndex)
        yield return source.Skip(prevIndex);
}

Over all your original approach is the best. If you need something fully lazy, then my below answer fits. Mind you its only meant for things like:

foreach (var inners in outer)
    foreach (var item in inners)
    { 
    }

and not

var outer = sequence.Split;
var inner1 = outer.First;
var inner2 = outer.ElementAt; //etc

In other words, not fit for multiple iterations on the same inner sequence. If you are fully aware of this dangerous construct:


Original answer:

This uses no intermediate concrete collections, no ToList on source enumerable, and is fully lazy/iterator-ish:

public static IEnumerable<IEnumerable<T>> SplitBy<T>(this IEnumerable<T> source,
                                                     Func<T, bool> separatorPredicate,
                                                     bool includeEmptyEntries = false,
                                                     bool includeSeparator = false)
{
    using (var x = source.GetEnumerator())
        while (x.MoveNext())
            if (!separatorPredicate(x.Current))
                yield return x.YieldTill(separatorPredicate, includeSeparator);
            else if (includeEmptyEntries)
            {
                if (includeSeparator)
                    yield return Enumerable.Repeat(x.Current, 1);
                else
                    yield return Enumerable.Empty<T>();
            }
}

static IEnumerable<T> YieldTill<T>(this IEnumerator<T> x, 
                                   Func<T, bool> separatorPredicate,
                                   bool includeSeparator)
{
    yield return x.Current;

    while (x.MoveNext())
        if (!separatorPredicate(x.Current))
            yield return x.Current;
        else
        {
            if (includeSeparator)
                yield return x.Current;
            yield break;
        }
}

Short, sweet and simple. I have added an additional flag to denote if you want to return empty sets (by default it ignores). Without that flag, the code is even more concise.

Thanks for this question, this will be there in my extension methods library! :)

Community
  • 1
  • 1
nawfal
  • 70,104
  • 56
  • 326
  • 368
  • 3
    I have done this myself this way. It works. It is very dangerous though because the caller cannot ever enumerate the inner sequence twice. If he will consume the next group without the bug being noticed. Very dangerous construct. (+1) – usr Feb 17 '13 at 20:19
  • 1
    @usr indeed. The best approach is to `ToList` on inner sequences right away (in other words, OP's existing approach itself)..i'll edit my answer.. – nawfal Feb 17 '13 at 20:48
  • 2
    With regards to @usr 's comment returning `IEnumerable>` instead of `IEnumerable>` would explicitly indicate that the user cannot enumerate the inner parts multiple times. – porges Feb 18 '13 at 02:44
  • 1
    I have an overload that has a "Lazy" suffix in its name. It returns an IE. The default overload returns an IE>. – usr Feb 18 '13 at 08:26
  • @usr that's exactly what I have in my library too! – nawfal Feb 18 '13 at 08:27
1

Here is a solution that I suppose does what you ask for.

The problem was that you only had one method with yield and you were manually creating the internal collection while the outer IEnumerable was enumerated. The second problem was that your way of "testing" fails even on mine code below. However, as David B pointed out in his comment, you must go through all the elements to define number of elements of outer IEnumerable. But you can defer creation and population of inner IEnumerables.

public static IEnumerable<IEnumerable<T>> SplitBetween<T>(this IEnumerable<T> source,Func<T,bool> separatorSelector, bool includeSeparators=false)
{
    IList<T> sourceList = source.ToList();
    var indexStart = 0;
    var indexOfLastElement = sourceList.Count - 1;
    for(int i = 0; i <= indexOfLastElement; i++)
        if (separatorSelector(sourceList[i]))
        {
            if(includeSeparators)
                yield return SplitBetweenInner(sourceList, indexStart, i);
            else
                yield return SplitBetweenInner(sourceList, indexStart, i - 1);

            indexStart = i + 1;
        }
        else if(i == indexOfLastElement)
            yield return SplitBetweenInner(sourceList, indexStart, i);        
}

private static IEnumerable<T> SplitBetweenInner<T>(IList<T> source, int startIndex, int endIndex)
{
    //throw new Exception("BOOM");
    for(int i = startIndex; i <= endIndex; i++)
        yield return source[i];
}

Note that it behaves slightly different as your code (it doesn't create another empty List when the last element satisfies the separator condition - it's up to definition what is correct here, but I find that better as the behavior is the same as if the element appears at the beginning of source list)

If you test the code, you will see that the inner IEnumerable execution is deferred.

If the throw exception line is uncommented:

(new List<int>(){3,4,2,21,3,2,17,16,1}).SplitBetween(x=>x>=10, true).Count();

returns 4

(new List<int>(){3,4,2,21,3,2,17,16,1}).SplitBetween(x=>x>=10, true).First().Count();

throws BOOM

doblak
  • 3,036
  • 1
  • 27
  • 22
  • 2
    time complexity - O(n^2 + 2n), because ElementAt time complexity on each call is O(n/2) and Count time complexity is O(n) – Serj-Tm Jul 13 '12 at 01:22
  • @DarkGray - That is incorrect. It totally depends on execution time type. The extension method accepts `IEnumerable` source. If you pass `List` as source parameter, it will just use `[]` for `ElementAt(index)` and `Count` property on a list. – doblak Jul 13 '12 at 01:28
  • 2
    worst case when source parameter is not IList and time complexity for worst case is O(n^2 + 2n) – Serj-Tm Jul 13 '12 at 01:34
  • 1
    I understand your conserns now. My intention was primarily to demonstrate yield behavior. I edited the answer and the O notation should not be a problem any more. Only a constant factor penalty can occur now (in cases when ToList() actually creates another instance) – doblak Jul 13 '12 at 01:46
1
  public static IEnumerable<IEnumerable<T>> SplitBetween<T>(this IEnumerable<T> source, Func<T, bool> separatorSelector, bool includeSeparators = false)
  {
    var state = new SharedState<T>(source, separatorSelector, includeSeparators);
    state.LastList = state.NewList  = new InnerList<T>(state, 0);
    for (; ; )
    {
      if (state.NewList != null)
      {
        var newList = state.NewList;
        state.NewList = null;
        yield return newList.Items();            
      }
      else if (state.IsEnd)
        break;
      else
        state.CheckNext();
    }
  }
  class SharedState<T>
  {
    public SharedState(IEnumerable<T> source, Func<T, bool> separatorSelector, bool includeSeparators)
    {
      this.source = source;
      this.separatorSelector = separatorSelector;
      this.includeSeparators = includeSeparators;

      this.iterator = source.GetEnumerator();

      this.data = source as IList<T>;
      if (data == null)
      {
        cache = new List<T>();
        data = cache;
      }
    }
    public readonly IEnumerable<T> source;
    readonly IEnumerator<T> iterator; 
    public readonly IList<T> data;
    readonly List<T> cache;
    public readonly Func<T, bool> separatorSelector;
    public readonly bool includeSeparators;
    public int WaveIndex = -1;
    public bool IsEnd = false;
    public InnerList<T> NewList;
    public InnerList<T> LastList;

    public void CheckNext()
    {
      WaveIndex++;
      if (!iterator.MoveNext())
      {
        if (LastList.LastIndex == null)
          LastList.LastIndex = WaveIndex;
        IsEnd = true;
      }
      else
      {
        var item = iterator.Current;
        if (cache != null)
          cache.Add(item);
        if (separatorSelector(item))
        {
          LastList.LastIndex = includeSeparators ? WaveIndex + 1 : WaveIndex;
          LastList = NewList = new InnerList<T>(this, WaveIndex + 1);
        }
      }
    }
  }

  class InnerList<T>
  {
    public InnerList(SharedState<T> state, int startIndex)
    {
      this.state = state;
      this.StartIndex = startIndex;
    }
    readonly SharedState<T> state;
    public readonly int StartIndex;
    public int? LastIndex;

    public IEnumerable<T> Items()
    {
      for (var i = StartIndex; ; ++i)
      {
        if (LastIndex != null && i >= LastIndex)
          break;
        if (i >= state.WaveIndex)
          state.CheckNext();
        if (LastIndex == null || i < LastIndex)
          yield return state.data[i];
      }
    }
  }
Serj-Tm
  • 16,581
  • 4
  • 54
  • 61
1

This one won't use List<>, and won't go BOOM.

public static IEnumerable<IEnumerable<T>> SplitBetween<T>(this IEnumerable<T> source,
                                                          Func<T,bool> separatorSelector, 
                                                          bool includeSeparators=false) 
{
    if (source == null)
        throw new ArgumentNullException("source");

    return SplitBetweenImpl(source, separatorSelector, includeSeparators);
}

private static IEnumerable<T> SplitBetweenInner<T>(IEnumerator<T> e,
                                                   Func<T,bool> separatorSelector)
{
    var first = true;

    while(first || e.MoveNext())
    {
        if (separatorSelector((T)e.Current))
            yield break;    

        first = false;
        yield return e.Current;
    }
}

private static IEnumerable<IEnumerable<T>> SplitBetweenImpl<T>(this IEnumerable<T> source,
                                                               Func<T,bool> separatorSelector, 
                                                               bool includeSeparators) 
{
    using (var e = source.GetEnumerator())
        while(e.MoveNext())
        {
            if (separatorSelector((T)e.Current) && includeSeparators)
                yield return new T[] {(T)e.Current};
            else
                {
                yield return SplitBetweenInner(e, separatorSelector);
                if (separatorSelector((T)e.Current) && includeSeparators)
                    yield return new T[] {(T)e.Current};
                }
        }
}

Test:

void Main()
{
    var list = new List<int>(){1, 2, 3, 10, 1};
    foreach(var col in list.Concat(Ext.ThrowingEnumerable<int>())
                           .SplitBetween<int>(x=>x>=10).Take(1))
    {
        Console.WriteLine("------");
        foreach(var i in col)
            Console.WriteLine(i);
    }
}

Output:

------
1
2
3

Test2

var list = new List<int>(){1, 2, 3, 10, 1}
foreach(var col in list.Concat(Ext.ThrowingEnumerable<int>())
                       .SplitBetween<int>(x=>x>=10).Take(2))

Output:

------
1
2
3
------
1
*Exception*

Here, the exception is caused because the first element of the ThrowingEnumerable-enumeration would go into the same group as the 1.


Test3:

var list = new List<int>(){1, 2, 3, 10, 1, 17};
foreach(var col in list.Concat(Ext.ThrowingEnumerable<int>())
                       .SplitBetween<int>(x=>x>=10, true).Take(4))

Output:

------
1
2
3
------
10
------
1
------
17

No problem here, because the Exception element would go into it's own group, and thus is not iterated over due to Take(4):

sloth
  • 99,095
  • 21
  • 171
  • 219
  • when `includeSeparator = true`, your behaviour is different from OP's original code's. For `.SplitBetween(x=> x >= 10, true)`, output should be `1, 2, 3, 10` i believe.. – nawfal Feb 17 '13 at 20:12