0

I want to know why orderBy consumes more memory then simply copying the list and sorting.

void printMemoryUsage()
{
    long memory = GC.GetTotalMemory(true);
    long mb = 1024 * 1024;
    Console.WriteLine("memory: " + memory/mb + " MB" );
}

var r = new Random();
var list = Enumerable.Range(0, 20*1024*1024).OrderBy(x => r.Next()).ToList();

printMemoryUsage();
var lsitCopy = list.OrderBy(x => x);
foreach(var v in lsitCopy)
{
    printMemoryUsage();
    break;
}

Console.ReadKey();

The result I got is:

memory: 128 MB

memory: 288 MB

But copying the list and sorting consume less memory.

void printMemoryUsage()
{
    long memory = GC.GetTotalMemory(true);
    long mb = 1024 * 1024;
    Console.WriteLine("memory: " + memory/mb + " MB" );
}

var r = new Random();
var list = Enumerable.Range(0, 20*1024*1024).OrderBy(x => r.Next()).ToList();

printMemoryUsage();


var lsitCopy = list.ToList();
printMemoryUsage();
lsitCopy.Sort();
printMemoryUsage();

Console.ReadKey();

Results are:

memory: 128 MB

memory: 208 MB

memory: 208 MB

More testing shows that memory consumed by orderBy is twice the list size.

Mayank Garg
  • 21
  • 1
  • 5
  • 1
    I cannot see where you printed mem usage for the starting point, so you cant really tell anything at all. There is also somewhat an art to measuring things like that in an environment that is self optimizing, has a debugger, JIT etc. You need to measure one method multiple times; then repeat using the other method. Who goes first often alters the metrics – Ňɏssa Pøngjǣrdenlarp Nov 20 '18 at 01:44

4 Answers4

0

It's a bit unsurprising when you dive into how the two approaches are implemented internally. Take a look at the Reference Source for .NET.

In your second approach where you call the Sort() method on the list, the internal array in the List object is passed to the TrySZSort method that is written in native code outside of C#, which means no work for the garbage collector.

private static extern bool TrySZSort(Array keys, Array items, int left, int right);

Now, in your first approach you're using LINQ to sort the enumerable. What's really happening when you call .OrderBy() is an OrderedEnumerable<T> object is constructed. Just calling OrderBy doesn't sort the list; it is only sorted when it is enumerated by the GetEnumerator method being called. GetEnumerator is implicitly called behind the scenes when you call ToList or when you enumerate over using a construct like foreach.

You're actually sorting the list twice since you're enumerating the list once on this line:

var list = Enumerable.Range(0, 20*1024*1024).OrderBy(x => r.Next()).ToList();

and again when you enumerate via foreach on this line:

var lsitCopy = list.OrderBy(x => x);
foreach(var v in lsitCopy)

Since these LINQ methods are not using native code, they rely on the garbage collector to pick up after them. Each of the classes is also creating a bunch of objects (e.g. OrderedEnumerable creates a Buffer<TElement> with another copy of the array). All of these objects consume RAM.

Connor
  • 807
  • 1
  • 10
  • 20
  • Thanks for your response. I understand that I am creating the list twice. So the increase in memory after first list creation should be close to the list size (around 80MB), but the increase in memory is 160MB. I didn't get that part. The difference in memory before and after the code " var lsitCopy = list.OrderBy(x => x); foreach(var v in lsitCopy) " should be around 80 MB. Why is it 160 MB? That is what I am trying to figure out. Thanks again – Mayank Garg Nov 20 '18 at 02:30
  • Not quite. There is a per-object byte overhead for each allocated object. See Jon Skeet's [answer here](https://stackoverflow.com/a/10655864/4403297) on this. You'd have to run a memory profiler to know for sure how many objects are allocated and where. My guess is that there are far more internal objects than you might guess behind the scenes in the LINQ approach, that coupled with the per-object overhead yields the ~160MB you're seeing. – Connor Nov 20 '18 at 03:10
0

I had to do some research on this one, and found some interesting information. The default List.Sort function performs an in-place sort (not a second copy), but does some via a call to Array.Sort, which ultimately calls through to TrySZSort, a heavily optimized native, unmanaged CLR function that selects the specific sort algorithm based on the input type, but in most cases performs what's called an Introspective Sort, which combines the best use cases of the QuickSort, HeapSort, and InsertSort for maximum efficiency. This is done in unmanaged code, meaning it's generally faster and more efficient.

If you're interested in going down the rabbit hole, the Array Sort source is here and the TrySZSort implementation is here. Ultimately though, the use of Unmanaged code means the garbage collector doesn't get involved, and thus less memory is used.

The implementation used by OrderBy is a standard Quicksort, and the OrderedEnumerable actually creates a second copy of the keys used in the sort (in your case the only field, though that doesn't have to be the case if you considered a larger class object with a single property or two used as the sorter), leading to exactly what you observed, which is additional usage equal to the size of the collection for the second copy. Assuming you then typed that out to a List or Array (rather than an OrderedEnumerable) and waited for or forced a garbage collection, you should recover most of that memory. The Enumerable.OrderBy method source is here if you want to dig in to it.

OperatorOverload
  • 724
  • 4
  • 17
  • Thanks for your response. `int` size on my machine is 4 byte, so we can safely assume that `list` size is around 80Mb. Now I understand that OrderedEnumerable will create another copy, so I should see an increase of 80Mb memory after these two line "var lsitCopy = list.OrderBy(x => x); foreach(var v in lsitCopy) { ". But I am getting memory increase of 160Mb which doesn't make sense to me. Before these two lines, memory usage was 128MB and after these two memory usage is 288 Mb. It seems like it copied list twice. – Mayank Garg Nov 20 '18 at 02:44
0

The source of extra memory used can be found in the implementation of OrderedEnumerable which is created on the line

IOrderedEnumerable<int> lsitCopy = list.OrderBy(x => x);

OrderedEnumerable is a generic implementation that sorts by any criteria you provide it, which is distinctly different to the implementation of List.Sort which sorts elements only by value. If you follow the coding of OrderedEnumerable you will find it creates a buffer into which your values are copied accounting for an extra 80MB (4*20*1024*1024) of memory. The additional 40MB (2*20*1024*1024) is associated with structures created to sort the list by the keys.

Another thing to note is not only does OrderBy(x => x) result in more memory use it also uses a lot more processing power, calling Sort by my testing is about 6 times faster than using OrderBy(x => x).

The List.Sort() method is backed by a native implementation heavily optimised method for sorting elements by their value, whereas the Linq OrderBy method is far more versatile and consequently less optimised for simply sorting the list by value...

IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)

P.S I would suggest you stop using var instead of the actual variable types as it hides valuable information to the reader of the code about how the code is actually functioning. I recommend developers only use the var keyword for anonymous types

Mick
  • 6,527
  • 4
  • 52
  • 67
  • Thanks for your response. I would keep in mind not to use `var` anymore. Regarding your answer, there is a difference of 160Mb in memory, while 80Mb goes to buffer, rest memory left unaccounted is 80Mb. After reading the answers from other, I think it is consumed by `map` in the GetEnumerator of OrderedEnumerable. – Mayank Garg Nov 20 '18 at 04:09
  • 1
    Your `var` suggestion is opinion-based and there is no rules when you should use or not. For example, these guys have different opinion: https://intellitect.com/when-to-use-and-not-use-var-in-c/ – Access Denied Nov 20 '18 at 04:10
  • 1
    To add to the `var` thing, it's fine to have your preferences and to share those with others but you should acknowledge that it is a very subjective thing (like tabs vs spaces) - Microsoft's C# conventions also have a different opinion on when it should/shouldn't be used: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/inside-a-program/coding-conventions#implicitly-typed-local-variables – Kirlac Nov 20 '18 at 04:13
  • @AccessDenied yea its an opinion. I think code submitted in this question highlights how problematic the use of var can be with ToList() being called on variables which are already a list. – Mick Nov 20 '18 at 04:21
  • 1
    @Kirlac I did use the words "I would suggest" and "I recommend" i think it implies subjectivity. And yes I'm aware there are other preferences, like Jet(4)Brain actually flags using the type instead of var as something to be "fixed". I don't think var was intended as a lazy shortcut. And the argument that it stops code from breaking when the types for methods change is a falsehood, when types change in your application I'd prefer to get a compile error and be made to think about the ramification, than to be unaware of the impact because var has been used everywhere. – Mick Nov 20 '18 at 04:27
  • It is a matter of opinion. But I think it will improve readability by using the full type instead of var. Especially when pasting code snippets. Thanks @Mick for the suggestion – Mayank Garg Nov 20 '18 at 04:27
0

Connor answer gave a clue what is happening here. Implementation of OrderedEnumerable makes it clearer. GetEnumerator of OrderedEnumerable is

    public IEnumerator<TElement> GetEnumerator() {
        Buffer<TElement> buffer = new Buffer<TElement>(source);
        if (buffer.count > 0) {
            EnumerableSorter<TElement> sorter = GetEnumerableSorter(null);
            int[] map = sorter.Sort(buffer.items, buffer.count);
            sorter = null;
            for (int i = 0; i < buffer.count; i++) yield return buffer.items[map[i]];
        }
    }

Buffer is another copy of the original data. And Map keeps the mapping of the order. So, if the code is

// memory_foot_print_1
var sortedList = originalList.OrderBy(v=>v)
foreach(var v in sortedList)
{
// memory_foot_print_2
...
}

Here, memory_foot_print_2 will be equal to memory_foot_print_1 + size_of(originalList) + size_of(new int[count_of(originalList)]) (assuming no GC)

Thus, if originalList is a list of ints of size 80Mb, memory_foot_print_2 - memory_foot_print_1 = 80 + 80= 160Mb. And if originalList is a list of logs of size 80Mb, memory_foot_print_2 - memory_foot_print_1 = 80+ 40 (size of map)= 120Mb (assuming int - 4bytes, longs- 8 bytes) which is what I was observing.

It leads to another question if it makes sense to use OrderBy for larger objects.

Mayank Garg
  • 21
  • 1
  • 5