6

Let's say I have the following generic combination generator static method:

public static IEnumerable<IEnumerable<T>> GetAllPossibleCombos<T>(
    IEnumerable<IEnumerable<T>> items)
{
    IEnumerable<IEnumerable<T>> combos = new[] {new T[0]};

    foreach (var inner in items)
        combos = combos.SelectMany(c => inner, (c, i) => c.Append(i));

     return combos;
}

Perhaps I am not understanding this correctly, but doesn't this build the entire combos list in RAM? If there are a large number of items the method might cause the computer to run out of RAM.

Is there a way to re-write the method to use a yield return on each combo, instead of returning the entire combos set?

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
sapbucket
  • 6,795
  • 15
  • 57
  • 94

2 Answers2

10

There are some misconceptions in your question, which is awesome because now you have an opportunity to learn facts rather than myths.


First off, the method you are implementing is usually called CartesianProduct, not GetAllPossibleCombos, so consider renaming it.


Perhaps I am not understanding this correctly

You are not understanding it correctly.

doesn't this build the entire combos list in RAM?

No. A query builder builds a query, not the results of executing the query. When you do a SelectMany, what you get is an object that will do the selection in the future. You don't get the results of that selection.

If there are a large number of items the method might cause the computer to run out of RAM.

Today would be a good day to stop thinking of memory and RAM as the same thing. When a process runs out of memory, it does not run out of RAM. It runs out of address space, which is not RAM. The better way to think about memory is: memory is on-disk page file, and RAM is special hardware that makes your page file faster. When you run out of RAM, your machine might get unacceptably slow, but you don't run out of memory until you run out of address space. Remember, process memory is virtualized.

Now, there may be scenarios in which executing this code is inefficient because enumerating the query runs out of stack. And there may be scenarios in which execution becomes inefficient because you're moving n items up a stack n deep. I suggest that you to do a deeper analysis of your code and see if that is the case, and report back.


Is there a way to re-write the method to use a yield return on each combo, instead of returning the entire combos set?

SelectMany is implemented as a yield return in a foreach loop, so you've already implemented it as a yield return on each combo; you've just hidden the yield return inside a call to SelectMany.

That is, SelectMany<A, B, C>(IE<A> items, Func<A, IE<B>> f, Func<A, B, C> g) is implemented as something like:

foreach(A a in items)
  foreach(B b in f(a))
    yield return g(a, b);

So you've already done it in yield return.

If you want to write a method that directly does a yield return that's a little harder; the easiest way to do that is to form an array of enumerators on each child sequence, then make a vector from each Current of the enumerators, yield return the vector, and then advance the correct iterator one step. Keep on doing that until there is no longer a correct iterator to advance.

As you can probably tell from that description, the bookkeeping gets messy. It is doable, but it's not very pleasant code to write. Give it a try though! The nice thing about that solution is that you are guaranteed to have good performance because you're not consuming any stack.

UPDATE: This related question has an answer posted that does an iterative algorithm, but I have not reviewed it to see if it is correct. https://stackoverflow.com/a/57683769/88656


Finally, I encourage you to compare your implementation to mine:

https://ericlippert.com/2010/06/28/computing-a-cartesian-product-with-linq/

Is my implementation in any way fundamentally different than yours, or are we doing the same thing, just using slightly different syntax? Give that some thought.

Also I encourage you to read Ian Griffiths' excellent six-part series on an analysis of various implementations of this function:

http://www.interact-sw.co.uk/iangblog/2010/07/28/linq-cartesian-1

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • Thank you for the valuable lesson. I see now that this question is a product of my misunderstanding. – sapbucket Aug 27 '19 at 17:47
  • @sapbucket: You are welcome. You are very right to question the performance characteristics of your implementation, just not for the reason you initially thought. There are possible performance problems here, and it is possible to make a better-performing implementation by doing all the bookkeeping yourself, but it is not pleasant. Also, believing that *making a query computes the results* is the most common misunderstanding about LINQ; I am always glad to disabuse people of that bad idea! – Eric Lippert Aug 27 '19 at 17:48
0

SelectMany and other Linq methods return an IEnumerable, which is lazily evaluated only when enumerating the collection. This can be in the form of a ToList() or ToArray() call or iterating over it in a foreach loop. When you see the message in the debugger warning that expanding a collection will enumerate the enumerable, this is the behavior it is warning you about. The collection hasn't been enumerated yet - a Linq query only builds up a chain of calls that tells it how to enumerate the data.

So, your concern about RAM usage isn't necessarily accurate (depending on the concrete type of the starting IEnumerable). Even if you call ToList() or ToArray() and store a reference to that in a variable, if the collection elements are reference types then it won't be a copy either.

In your example, yield return gives you convenience if you want to lazily build a collection of elements without storing it in a separate collection (e.g. a return list or array, which requires additional copying). I don't think it applies to what you're trying to do, since SelectMany has this behavior already.

If you want to try it out, Linq makes it pretty easy to generate large lists with Enumerable.Repeat

// Define a collection with 10000000 items (items not created yet)
var manyItems = Enumerable.Repeat(123, 10000000);

// Enumerate the enumerable via ToList: creates the int 10000000 times
var manyItemsConcrete = manyItems.ToList();

// same deal with reference types
var manyReferenceTypes = Enumerable.Repeate(new object(), 10000000);
var manyReferenceTypesConcrete = manyReferenceTypes.ToList();

// This list already exists in RAM taking up space
var list = new List<object> { new object(), new object() /* ... x10000000 */ }
// This defines a transform on list, but doesn't take up RAM
var enumerable = list.Select(x => x.ToString());

// Now, there are two lists taking up RAM
var newList = enumerable.ToList();

E. Moffat
  • 3,165
  • 1
  • 21
  • 34