Speeding up enumeration for large numbers of items

Question

Some quick background - I've written some code which generates combinations of items based on the excellent (and seems to be highly efficient) Combinations library from Permutations, Combinations, and Variations using C# Generics

If you were to write out what I'm trying to do in the simplest form it would be:

var items = dataService.GetAllItems();

PreFilters.ForEach(filter => items = filter.ApplyFilter(items));

var combinations = from item1 in items
                   from item2 in items
                   from item3 in items
                   from item4 in items
                   select new SpecialContainer(new List<Item>(){item1, item2, item3, item4});
combinations = combinations.Distinct().ToList();

PostFilters.ForEach(filter => combinations = filter.ApplyFilter(combinations));

...where SpecialContainer can calculate a unique key for the combination based on an Id field in Item (literally sorts by Id and concatenates into a string of Id1:Id2:Id3:Id4).

While the performance for combinations of 4 is not bad (280k items, 240ms actual), once we hit combinations of 8 (896mil, 16mins projected), things slow down a bit. I'd ideally also like to do combinations of 12, but considering there is a massive items created jump from 4 to 8, the jump from 8 to 12 may mean years of time to calculate when done sequentially.

I am filtering these combinations based on other criteria which helps reduce the complexity/number of combinations before the combination process (Items disqualified based on known criteria beforehand) and after the entire list has been enumerated as shown in the code sample.

So my questions:

The combinations library I'm using exposes itself as an IEnumerable<T> and yields the combinations only on enumeration. Is there anyway I can take advantage of AsParallel processing while enumerating and discard irrelevant, non-unique combinations using the Filter classes as shown above?
I have been using 'ToList()' internally in the Filter classes to convert the IEnumerable<T> to List<T> at each stage - should I rather leave them as IEnumerable<T> until the very end?

Update:

The actual call that creates the issue is as follows:

var combinator = new Combinations<Item>(items, 8, GenerateOption.WithRepetition);
var combinations = combinator.Select(x => new SpecialContainer(x)).ToList();

The ToList() in the above code is the call that takes the time.

Update 2:

Last night I ensured that all the PostFilter items did not call ToList() in their ApplyFilter implementations and neither did the combinations.Distinct() call so "pure" IEnumerable<T> everywhere up until the final call just before I bind it to a UI. I also added in an extra AsParallel() before the set processes.

So the new code looks like:

var combinations = combinator.Select(x => new SpecialContainer(x)));
combinations = combinations.Distinct().AsParallel();
PostFilters.ForEach(filter => combinations = filter.ApplyFilter(combinations));
var finalList = combinations.ToList();

The reasoning behind this was to allow the compiler to potentially optimise the resulting IEnumerable<T> operation stack. I kicked it off at 10pm last night and while the work was definitely being performed on separate threads, the whole process had not completed by 6am this morning (when previous performance was approx 16mins sequentially).

However, reading some further articles on LINQ Performance and about how PLINQ can affect performance, I think the next step in trying to optimise this in a parallel programming pattern is to change my filters to run against single objects which return a bool, so my code will end up like:

var combinations = combinator.Select(x => new SpecialContainer(x)));
combinations = combinations.Distinct().AsParallel();
combinations = combinations.Where(x => PostFilters.All(y => y.AppyFilter(x));
var finalList = combinations.ToList();

Can you provide a sample app around this? It's an interesting problem. Where are you seeing the worst performance? To me it seems it could be either of the three major activities (PreFilter, Combine, PostFilter). I see no reason why the parallel extensions wouldn't work for the combination step, it isn't mutating any shared state and simply yields a new list. As for `ToList`, I would say it should only be done once if you know you are going to iterate it multiple times. — Adam Houldsworth, Jan 18 '16 at 09:46
I'll be able to provide a GitHub with the full code this evening only. I was wondering whether anyone might have insights from a pure parallel processing theory point of view. You also should be able to build your own simple app from the description I've given tbh... The performance hit is taken when calling `ToList()` on the `Combinations` item created. — toadflakz, Jan 18 '16 at 09:50
In theory you should see gains, but it depends on if the work can be chunked effectively. You might find gains doing that yourself because you know the make-up of the arrays, to create a series of Tasks that can be waited on, then collate the results. Also, you can pre-allocate arrays because you know or can calculate the eventual size (so you remove completely the cost of re-allocating and copying as you go, which will have a cost as the problem grows). Just FYI the best way to get a good answer from SO is to provide a complete working example of the problem. — Adam Houldsworth, Jan 18 '16 at 09:52
You're correct, calling ToList() forces the system to execute the above query without giving it a chance for any optimization. — Vishnu Prasad V, Jan 18 '16 at 09:52
@toadflakz You mention trying to remove duplicates using the filters before making the combinations... the easiest way to guarantee that would be to "Distinct" each sub-list of items beforehand, which can remove potentially a large raft of iterating. — Adam Houldsworth, Jan 18 '16 at 09:57
@AdamHouldsworth ToList() does produce a bottleneck when the data is large. i made a quick search for you. see http://stackoverflow.com/a/15517544/5077042 — Vishnu Prasad V, Jan 18 '16 at 10:00
@AdamHouldsworth - No, I remove items that I know I can exclude due to property values on the item (for example, say a certain property must be greater than 5). The problem is that the combinations algorithm does lexicographic permutations without repetition but I need repeats but don't need the same logical combinations (i.e. 1,1,2,3 == 1,3,1,2 ). — toadflakz, Jan 18 '16 at 10:01
@vishnu it is only a performance problem if you don't need to iterate the entire list, but in this case it is guaranteed. — Adam Houldsworth, Jan 18 '16 at 10:28

Speeding up enumeration for large numbers of items

0 Answers0