10

I am interested in the asymptotic complexity (big O) of the GroupBy operation on unindexed datasets. What's the complexity of the best known algorithm and what's the complexity for algorithms that SQL servers and LINQ are using?

Jakub Šturc
  • 35,201
  • 25
  • 90
  • 110

3 Answers3

5

About Linq, I guess you want to know about the Linq-to-object group by complexity (Enumerable.GroupBy).

Checking the implementation with ILSpy, it appears to me it is O(n). (.Net Framework 4 series.)

It enumerates the source collection once. For each element, it computes its grouping key. Then it checks if it has already the key in a hashtable mapping to elements lists, adding the key to the hashtable if it is missing. Then it adds the element to the corresponding entry list in the hashtable.

Frédéric
  • 9,364
  • 3
  • 62
  • 112
  • +1, though worth noting that hashtable operations are only *expected* amortized O(1); worst-case is O(n) which makes GroupBy's worst case O(n^2), though unlikely in practice. Also worth noting some hash table implementations can access multiple elements on average while still being O(1) because the average number of elements accessed doesn't grow with n, though I think .NET's uses a load factor of 1 so actually only 1 element on average. – Kevin Feb 18 '21 at 08:08
5

Ignoring the base SQL that the group by is working on, when presented to the GROUP BY operation itself, the complexity is just O(n) since the data is scanned per-row and aggregated in one pass. It scales linearly to n (the size of the dataset).

When Group By is added to a complex query the equation changes, O(n) becomes the upper bound that the Group By adds to the overall equation; it could be less if the inner complex query is such that in the resolution of the base query, the data is already sorted.

RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • 1
    And because there is no index, when the data is sorted, you've already spent O(N log N) sorting it. (nitpick: it scales linearly to n, i.e. to the size of the dataset, not to the size of n) – R. Martinho Fernandes Feb 03 '11 at 18:25
  • Sorry but this is wrong. When you are iterating through the dataset you have to decide to which group you want to put in given row/object. I cannot see how can be group selection done in constant time. – Jakub Šturc Feb 03 '11 at 18:40
  • @Jak - SQL Server's stream aggregate operates in near-linear time to n, so even if the overall time is O(n)+O(n log n log n), it is of O(n) complexity – RichardTheKiwi Feb 03 '11 at 18:50
  • 4
    O(n) is not constant time, its linear time. O(1) is constant time.`` – user7116 Feb 03 '11 at 20:30
  • @sixlettervariables: I know. To perform GroupBy you have to go through all items (that's O(n)) and for each item decide to which group it belongs (that's not O(1)). – Jakub Šturc Feb 03 '11 at 20:36
  • Certainly if you design some group selection mechanism more complicated than O(1) it will add to the complexity. However, say the grouping is on integers, grouping is certainly O(1). If the groupings are string keys, it is O(k) where k is the maximum string length, which we would say is still O(1). Have I missed which part you're saying is > O(1)? – user7116 Feb 03 '11 at 20:51
  • @sixlettervariables: I am arguing that all group selection mechanisms are in the worst case more complex that O(1). I cannot see how can be grouping on integers done in constant time (and space). – Jakub Šturc Feb 04 '11 at 09:53
  • @sixlettervariables If I could thumb down that comment I would. You cannot be correct that a string comparison algorithm defined as O(k) is reducable to O(1). If that were the case, then the Radix Sort has a complexity of O(n) -- in reality, it's O(nk) because, in both string comparison and radix sorts, the length of the comparison key can vary across data sets (and this variation directly impacts computation time in a well-defined and predictable way). – Squirrelsama May 10 '12 at 22:31
  • @Legatou: and I agree...I should have said the comparisons are `O(k)` and `O(1)`. – user7116 May 10 '12 at 22:39
  • @Squirrelsama O(k) is reducible to O(1) in the context where k is a constant (e.g. it is really O(15) which is your field length) relative to your row count (which is n). – NetMage Feb 02 '22 at 22:16
4

Grouping can be done in one pass (n complexity) on sorted rows (nlog(n) complexity) so complexity of group by is nlog(n) where n is number of rows. If there are indices for each column used in group by statement, the sorting is not necessary and the complexity is n.

JosefN
  • 952
  • 6
  • 8