4

I'd like to know if there's a more efficient way to get an ordered list of groups by value from an initially unordered list, than using GroupBy() followed by OrderBy(), like this:

List<int> list = new List<int>();
IEnumerable<IEnumerable<int>> orderedGroups = list.GroupBy(x => x).OrderBy(x => x.Key);

For more detail, I have a large List<T> which I'd like to sort, however there are lots of duplicate values so I want to return the results as IEnumerable<IEnumerable<T>>, much as GroupBy() returns an IEnumerable of groups. If I use OrderBy(), I just get IEnumerable<T>, with no easy way to know whether the value has changed from one item to the next. I could group the list then sort the groups, but the list is large so this ends up being slow. Since OrderBy() returns an OrderedEnumerable which can then be sorted on a secondary field using ThenBy(), it must internally distinguish between adjacent items with the same or different values.

Is there any way I can make use of the fact that OrderedEnumerable<T> must internally group its results by value (in order to facilitate ThenBy()), or otherwise what's the most efficient way to use LINQ to get an ordered list of groups?

Jon G
  • 4,083
  • 22
  • 27
  • 6
    Posting your code would help. – Yuval Itzchakov May 14 '15 at 07:18
  • Added an example implementation which hopefully can be improved upon – Jon G May 14 '15 at 07:25
  • 5
    Can you use .Distinct() ? – Mairaj Ahmad May 14 '15 at 07:27
  • Can you provide some sample data and the desired result? – Ham3d May 14 '15 at 07:30
  • Mairaj, Distinct() is really just a special case of GroupBy(), I'm not sure how it would help here – Jon G May 14 '15 at 07:33
  • Delay incuring the sort penalty until you need it, just group it first, will not work? have you try sort it first then group it and see if it is faster than grouping then sorting? `OrderBy` and `ThenBy` will just basically create a field which concatenate the value in order of appearance and link it the actual object reference. It won't sort it until you execute the query internally I think (at least that how I'd implement that). Sort happen on the temporary index and value get copied of using the reference into the right position in the new enumerable. – Jimmy Chandra May 14 '15 at 07:34
  • Ham3d, try plugging a list of a million or so random 5 digit ints into the sample code above – Jon G May 14 '15 at 07:34
  • Jimmy, I think pulling the first value from the result necessarily incurs the penalty of the sort, since it must be the first member of the group with the smallest value. – Jon G May 14 '15 at 07:40
  • try reorder your code. list.OrderBy(x => x.Key).GroupBy(x => x).ToDictionary(x=> x.Key, y=> y.ToList()); – JSJ May 14 '15 at 08:28
  • There is no built-in. You could roll your own. – Martijn May 14 '15 at 08:58
  • I think your best option here is to use a `SortedDictioanry` – Magnus May 14 '15 at 09:22
  • @JonG I'm just going to say... try timing it. Because I tried with a list of 1000000 random five-digit integers and `GroupBy(i => i).OrderBy(g => g.Key)` was 25% faster than just `OrderBy(i => i)`. – Rawling May 14 '15 at 10:04
  • What you have is fine. A handrolled solution can be seen at https://ideone.com/HBWY02 – Martijn May 14 '15 at 10:27
  • @JonG: are you actually trying to do a GroupBy()? Because in your posted code, you're just grouping by the values in your collection, which is the same as `Distinct()`. – StriplingWarrior May 14 '15 at 23:17
  • StriplingWarrior - it's not quite the same, GroupBy() returns IEnumerable of group of T, whereas Distinct() returns IEnumerable of T – Jon G May 15 '15 at 06:26

3 Answers3

3
  • You can use ToLookup, which returns an IEnumerable<IGrouping<TKey, TElement> and then do OrderBy for values of each key on demand. This will be O(n) to create the lookup and O(h) to order elements under each group (values for a key) assuming h is the number of elements under a group

  • You can improve the performance to amortized O(n) by using IDictionary<TKey, IOrderedEnumerable<T>>. But if you want to order by multiple properties, it will again by O(h) on the group. See this answer for more info on IOrderedEnumerable. You can also use SortedList<TKey, TValue> instead of IOrderedEnumerable

[Update]:

Here is another answer which you can take a look. But again, it involves doing OrderBy on top of the result.

Further, you can come up with your own data structure as I don't see any data structure available on BCL meeting this requrement.

One possible implementation:

You can have a Binary Search Tree which does search/delete/insert in O(longN) on an average. And doing an in-order traversal will give you sorted keys. Each node on the tree will have an ordered collection for example, for the values.

node roughly looks like this:

public class MyNode
{
    prop string key;
    prop SortedCollection myCollection;
}

You can traverse over the initial collection once and create this special data structure which can be queried to get fast results.

[Update 2]: if you have possible keys below 100k, then I feel implementing your own data structure is an overkill. Generally an order by will return pretty fast and the time taken is tiny. Unless you have large data and you do order by multiple times, ToLookup should work fairly well.

Community
  • 1
  • 1
Amit
  • 25,106
  • 25
  • 75
  • 116
  • I really like ToLookup(), had never noticed that method before. However, in this case the benefit of key-based retrieval doesn't help with the need to sort the keys, so even if I replace GroupBy() with ToLookup(), I still have to sort the entire IEnumerable> by the key. – Jon G May 15 '15 at 06:34
  • there is no DataStructure available out of the box on BCL which can provide ordered keys and ordered values. if you use dictionary/group then order of keys are not guaranteed. you will have to do an order by on it. You can come up with your own DataStructure which stores ordered keys and ordered IList/IEnumerable as value value. – Amit May 15 '15 at 06:45
  • 1
    @Amit What makes you think he needs ordered values? – Rawling May 15 '15 at 07:16
  • @Rawling, looks like you are right, I read the OrderedEnumerable to be ordered values. – Amit May 15 '15 at 07:21
1

Honestly, you're not going to do much better than

items.GroupBy(i => i.KeyProperty).OrderBy(g => g.Key);

GroupBy is an O(n) operation. The OrderBy is then O(k log k) where k is the number of groups.

If you call OrderBy first... well, firstly, your O(n log n) is now in your number of items rather than your number of groups, so it's already slower than the above.

And secondly, an IOrderedEnumerable doesn't have the internal magic you think it does. It isn't an ordered sequence that contains groups of same-ordered items which can then by reordered with ThenBy; it's an unordered sequence with a list of sort keys which ThenBy adds to, which is eventually ordered by each of those keys when you iterate over it.

You may be able to eke out a little more speed by rolling your own "group and sort" loop, maybe manually adding to an SortedDictionary<TKey, IList<TItem>>, but I don't think you're going to get a better big O than what out-of-the-box LINQ gets you.LINQ

Rawling
  • 49,248
  • 7
  • 89
  • 127
  • I remember reading somewhere that OrderedDictionary is indexed based on insertion order and the name should have been IndexedDixtionary. Just something to be aware about the class. – Amit May 15 '15 at 08:26
  • @Amit Whoops. I meant `SortedDictionary`. Cheers for making me look it up! – Rawling May 15 '15 at 08:32
  • Having reflected on the classes involved, I can see you're right about OrderedEnumerable, the sort is performed using quicksort which doesn't explicitly group items of the same value – Jon G May 15 '15 at 11:19
0

I think iterating thru the list for(;;) as you populate Dictionary<T, int>, where value is count of repeated elements will be faster.

Riad Baghbanli
  • 3,105
  • 1
  • 12
  • 20