16

Really curious for the below program (yes run in release mode without debugger attached), the first loop assigns a new object to each element of the array, and takes about a second to run.

So I was wondering which part was taking the most time--object creation or assignment. So I created the second loop to test the time required to create the objects, and the third loop to test assignment time, and both run in just a few milliseconds. What's going on?

static class Program
{
    const int Count = 10000000;

    static void Main()
    {
        var objects = new object[Count];
        var sw = new Stopwatch();
        sw.Restart();
        for (var i = 0; i < Count; i++)
        {
            objects[i] = new object();
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds); // ~800 ms
        sw.Restart();
        object o = null;
        for (var i = 0; i < Count; i++)
        {
            o = new object();
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds); // ~ 40 ms
        sw.Restart();
        for (var i = 0; i < Count; i++)
        {
            objects[i] = o;
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds); // ~ 50 ms
    }
}
lobsterism
  • 3,469
  • 2
  • 22
  • 36
  • I assume you're also measuring the indexer of `object[]` which also checks if the given index is out of range. **Edit**: @lobsterism: Yes. – Tim Schmelter Aug 10 '13 at 18:27
  • @TimSchmelter wouldn't that happen in the third loop too though? – lobsterism Aug 10 '13 at 18:27
  • 4
    In some cases, the jitter can decide to completely elide the loops when they have no observable effect on the rest of your program. This may be one of those cases. You'd have to look at the disassembled code. – Mike Zboray Aug 10 '13 at 18:28
  • Almost certainly, the slow step is going to be garbage collection, which occurs upon allocation. – Mike Zboray Aug 10 '13 at 18:33
  • @mikez If I replace the 2nd loop with `o = o == null ? new object() : null;` it guarantees each iteration has an effect and surprisingly makes it *faster*. – lobsterism Aug 10 '13 at 18:36
  • 1
    Yes jit is smart enough to analyze your code does nothing in second loop. – Sriram Sakthivel Aug 10 '13 at 18:36
  • 3
    @SriramSakthivel then it shouldn't take 40ms. Also the JIT is stupid as hell right now. It does not know that new object does nothing. – usr Aug 10 '13 at 18:48
  • It's probably optimizing `o`'s reference into a register/generating an array filled with references to `o. – It'sNotALie. Aug 10 '13 at 18:50
  • 2
    @mike z the garbage collection will not necessary happen immediately. – Tarik Aug 10 '13 at 18:51
  • 1
    The difference between these cases is that only in case 1 you keep references to different objects; in case 2 you keep reference to one object at a time, in case 3 there is just one object. I don't think the two cases add up to the same thing as case 1. – Joanna Derks Aug 10 '13 at 20:55
  • @JoannaTurban "keeping" references doesn't equate to CPU time though. Once a reference is made and assigned, the CPU doesn't have anything to do with it anymore until you actually use the reference. Anyway the solution turned out to be the GC. – lobsterism Aug 10 '13 at 23:31
  • @usr You're *almost* right; see my 2nd comment under supercat's answer. – lobsterism Aug 10 '13 at 23:35
  • @lobsterism Have you tried running your code on mono? – caldis Aug 11 '13 at 07:52

2 Answers2

16

When an object which occupies less than 85,000 bytes of RAM and is not an array of double is created, it is placed in an area of memory called the Generation Zero heap. Every time the Gen0 heap grows to a certain size, every object in the Gen0 heap to which the system can find a live reference is copied to the Gen1 heap; the Gen0 heap is then bulk-erased so it has room for more new objects. If the Gen1 heap reaches a certain size, everything there to which a reference exists will be copied to the Gen2 heap, whereupon the Gen0 heap can be bulk-erased.

If many objects are created and immediately abandoned, the Gen0 heap will repeatedly fill up, but very few objects from the Gen0 heap will have to be copied to the Gen1 heap. Consequently, the Gen1 heap will be filled very slowly, if at all. By contrast, if most of the objects in the Gen0 heap are still referenced when the Gen0 heap gets full, the system will have to copy those objects to the Gen1 heap. This will force the system to spend time copying those objects, and may also the Gen1 heap to fill up enough that it will have to be scanned for live objects, and all the live objects from there will have to be copied again to the Gen2 heap. All this takes more time.

Another issue which slows things in your first test is that when trying to identify all live Gen0 objects, the system can ignore any Gen1 or Gen2 objects only if they haven't been touched since the last Gen0 collection. During the first loop, the objects array will be touched constantly; consequently, every Gen0 collection will have to spend time processing it. During the second loop, it's not touched at all, so even though there will be just as many Gen0 collections they won't take as long to perform. During the third loop, the array will be touched constantly, but no new heap objects are created, so no garbage-collection cycles will be necessary and it won't matter how long they would take.

If you were to add a fourth loop which created and abandoned an object on each pass, but which also stored into an array slot a reference to a pre-existing object, I would expect that it would take longer than the combined times of the second and third loops even though it would be performing the same operations. Not as much time as the first loop, perhaps, since very few of the newly-created objects would need to get copied out of the Gen0 heap, but longer than the second because of the extra work required to determine which objects were still live. If you want to probe things even further, it might be interesting to do a fifth test with a nested loop:

for (int ii=0; ii<1024; ii++)
  for (int i=ii; i<Count; i+=1024)
     ..

I don't know the exact details, but .NET tries to avoid having to scan entire large arrays of which only a small part is touched by subdividing them into chunks. If a chunk of a large array is touched, all references within that chunk must be scanned, but references stored in chunks which haven't been touched since the last Gen0 collection may be ignored. Breaking up the loop as shown above might cause .NET to end up touching most of the chunks in the array between Gen0 collections, quite possibly yielding a slower time than the first loop.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • +1 Analysis with perfmon shows 80% for `% Time in GC` for the first loop, but 0% for the second, even though both have about the same number of collections. – lobsterism Aug 10 '13 at 20:35
  • A small factor in the 2nd loop also ended up being loop unrolling. http://stackoverflow.com/questions/18166501/for-loop-performance-oddity-in-net-x64-even-number-iteration-affinity It unrolls it to e.g. `o = new object(); o = new object(); o = new object(); o = new object()` and increments by 4. Then presumably optimizes out the first three assignments in each iteration. If you make `Count` odd, the optimization doesn't occur and the loop takes about 80 ms. Still the GC was by far the greater cause. – lobsterism Aug 10 '13 at 23:25
14
  1. You create 10 million objects and store them in separate locations in memory. Memory consumption is highest here.
  2. You create 10 million objects, but they are not stored anywhere, just discarded.
  3. You create 1 object and make 10 million references to it, minimal memory consumption.

And yes, performance analisys bellow is for only 10 thousands of objects (10 million would take toooo long).

Performance for ONLY 10 thousand objects

UPDATE: this diagram shows CPU work for memory allocation in first case. Notice JIT_New@@... function taking 80.5% of CPU time.

CPU performance case 1

UPDATE2: and for completeness CPU time for CaseTwo.

CPU performance case 2

UPDATE3: Just for completeness, third case

CPU performance case 3

Nenad
  • 24,809
  • 11
  • 75
  • 93
  • 2
    But what does this have to do with speed? – lobsterism Aug 10 '13 at 18:58
  • 1
    Seems logical to me. +1 – Saeed Neamati Aug 10 '13 at 19:04
  • @lobsterism: at the end of 1st case array references 10 million objects. 2nd case you have 10 million objects that are not-used (garbage collectable). 3rd case array references 1 object 10 million times. 1 object consumes less memory then 10 million. – Nenad Aug 10 '13 at 19:10
  • Seems logical but I'd have preferred an answer with more evidence ensuring that this is really the explanation. Also, for case 1 there is the main point missing: the GC has to traverse millions of objects which is much slower than traversing one object. – usr Aug 10 '13 at 19:15
  • Storing 10 million objects onto heap is done by CPU. That's "work" and latency. – Nenad Aug 10 '13 at 19:18
  • 1
    The objects are created on the heap in loop 2 as well. Once they're on the heap, they don't require any extra "work" to stay there. This has nothing to do with execution speed. – lobsterism Aug 10 '13 at 19:32
  • 1
    @lobsterism It is likely case 2 optimizes away the memory assignment, if it where not so case 1 and case 2 would use the same amount of memory in the performance graphs posted in this answer. – Scott Chamberlain Aug 10 '13 at 19:34
  • @ScottChamberlain I assume the array itself is included in `CaseOne`, whereas `CaseTwo` only counts the objects. – lobsterism Aug 10 '13 at 19:36
  • Added one more diagram: CPU time for storing object to heap. – Nenad Aug 10 '13 at 19:37
  • 1
    @Nenad you should show the same image of case 2 for comparision. – Scott Chamberlain Aug 10 '13 at 19:38
  • @ScottChamberlain just did that. :) – Nenad Aug 10 '13 at 19:40
  • 1
    Microbenchmarks cannot use line-by-line tracing profiling because the tracing disturbs the measurements. I would use sampling profiling here, maybe using perfview. Also, multiple iterations are needed to even out GC actions. – usr Aug 10 '13 at 19:42
  • 1
    @usr: Point was to make clear where work goes. Not to profile millisecond precise performance. Difference between 1st case and 2nd and 3rd is multiple of 20. This is good enough. – Nenad Aug 10 '13 at 19:47