Some quick background - I've written some code which generates combinations of items based on the excellent (and seems to be highly efficient) Combinations library from Permutations, Combinations, and Variations using C# Generics
If you were to write out what I'm trying to do in the simplest form it would be:
var items = dataService.GetAllItems();
PreFilters.ForEach(filter => items = filter.ApplyFilter(items));
var combinations = from item1 in items
from item2 in items
from item3 in items
from item4 in items
select new SpecialContainer(new List<Item>(){item1, item2, item3, item4});
combinations = combinations.Distinct().ToList();
PostFilters.ForEach(filter => combinations = filter.ApplyFilter(combinations));
...where SpecialContainer
can calculate a unique key for the combination based on an Id
field in Item
(literally sorts by Id
and concatenates into a string
of Id1:Id2:Id3:Id4
).
While the performance for combinations of 4 is not bad (280k items, 240ms actual), once we hit combinations of 8 (896mil, 16mins projected), things slow down a bit. I'd ideally also like to do combinations of 12, but considering there is a massive items created jump from 4 to 8, the jump from 8 to 12 may mean years of time to calculate when done sequentially.
I am filtering these combinations based on other criteria which helps reduce the complexity/number of combinations before the combination process (Items disqualified based on known criteria beforehand) and after the entire list has been enumerated as shown in the code sample.
So my questions:
- The combinations library I'm using exposes itself as an
IEnumerable<T>
and yields the combinations only on enumeration. Is there anyway I can take advantage ofAsParallel
processing while enumerating and discard irrelevant, non-unique combinations using theFilter
classes as shown above? - I have been using 'ToList()' internally in the
Filter
classes to convert theIEnumerable<T>
toList<T>
at each stage - should I rather leave them asIEnumerable<T>
until the very end?
Update:
The actual call that creates the issue is as follows:
var combinator = new Combinations<Item>(items, 8, GenerateOption.WithRepetition);
var combinations = combinator.Select(x => new SpecialContainer(x)).ToList();
The ToList()
in the above code is the call that takes the time.
Update 2:
Last night I ensured that all the PostFilter
items did not call ToList()
in their ApplyFilter
implementations and neither did the combinations.Distinct()
call so "pure" IEnumerable<T>
everywhere up until the final call just before I bind it to a UI. I also added in an extra AsParallel()
before the set processes.
So the new code looks like:
var combinations = combinator.Select(x => new SpecialContainer(x)));
combinations = combinations.Distinct().AsParallel();
PostFilters.ForEach(filter => combinations = filter.ApplyFilter(combinations));
var finalList = combinations.ToList();
The reasoning behind this was to allow the compiler to potentially optimise the resulting IEnumerable<T>
operation stack. I kicked it off at 10pm last night and while the work was definitely being performed on separate threads, the whole process had not completed by 6am this morning (when previous performance was approx 16mins sequentially).
However, reading some further articles on LINQ Performance and about how PLINQ can affect performance, I think the next step in trying to optimise this in a parallel programming pattern is to change my filters to run against single objects which return a bool
, so my code will end up like:
var combinations = combinator.Select(x => new SpecialContainer(x)));
combinations = combinations.Distinct().AsParallel();
combinations = combinations.Where(x => PostFilters.All(y => y.AppyFilter(x));
var finalList = combinations.ToList();