CPU cost order of magnitude for some basic operations

Question

After answering to that SO question and being donwvoted I'd like to check something with you.

So as to have a draft idea of the cost of the code I write, I have the tendency to scale operations this way.

Heap allocation is around 1000 times slower than Stack allocation.
IO with screen/output is around 1000 times slower than Heap allocation.
IO on harddrive is around 1000 times slower than graphical IO on screen.

Do you think this is correct assumption / order of magnitude / estimation ?

(And of course, there is nothing like a real profiling of the application :-))

EDIT: as a first conclusion according to your answers and comment, one may say that my figure 1000 is largely overestimated.

I seriously doubt that heap allocation is 1000 times slower than stack allocation. It is slower, of course, but not by that much. — driis, Feb 12 '11 at 11:15
This is meaningless. What does it mean to compare the speed of heap allocation and console IO? — Oliver Charlesworth, Feb 12 '11 at 11:15
@Oli: especially when such output is bufferred (or so I hope) :/ — Matthieu M., Feb 12 '11 at 11:18
@Oli: it actually may have sense. For example, one might want to compare the time of passing an integer from one program part to another (1) through a neap-allocated structure, and (2) through a file (with or without data flush). — Vlad, Feb 12 '11 at 11:18
@Vlad: Tenuous (and probably not what the OP had in mind), but yes, you're right. — Oliver Charlesworth, Feb 12 '11 at 11:19
well safest way to test it is benchmark.. about 100M operations from each and checking start-end times. It'll be specific to your platform but you could get rough idea.. — Raven, Feb 12 '11 at 11:24
You realize you've indirectly stated that hard drive IO is a billion times slower than stack "allocation"? If stack allocation takes a single CPU cycle on a 1 GHz machine, then you're saying that hard drive I/O would take a full second on the same machine? Do the math, you're insane. — SoapBox, Feb 12 '11 at 12:14
I realize it's a good idea I asked that question since it seems I am wrong :) — Stephane Rolland, Feb 12 '11 at 12:50
@Oli. Just to be clearer, imagine one little chunk of data, one int. In my statement I was comparing the cost of one int on the stack, one int on the heap, one int to the graphical buffer ( changing one pixel on the screen for example), one int written to the disk. — Stephane Rolland, Feb 12 '11 at 12:53
@Stephane, it would depend on the environment even more than console I/O. Changing one pixel in DOS is basically directly writing to video RAM. Doing the same thing with GDI will be very much different. Using DirectX you get yet another result. In X11 it may be yet another time. And these times could very well differ in orders of magnitude. And writing an int to a disk could be very different depending on whether the file is already opened or not. — Sergei Tachenov, Feb 12 '11 at 13:39
*Heap allocation is around 1000 times slower than Stack allocation.* That's too much of an abstraction to understand how memory works. Accessing memory out of the cache is slow and nowadays memory access "is the new hard disk". It's about how many cache-misses occur to calculate cost, the more indirections the higher the cost. The memory is just not random access. Microbenchmarks do lie and they lie a lot if you don't know how/what to test. — bestsss, Feb 12 '11 at 21:24

paxdiablo · Answer 1 · 2011-02-12T11:32:25.187

If you're going to make massive generalisations like that, you might want to think about having hard data to back them up.

I don't doubt that you're right about the relative efficiencies on most architectures (I say most simply because there may be some weird architectures out there that I don't know about) but the 1000x ratios are suspect without proof.

And actually, I'm not that certain about the relative efficiency of screen and disk I/O since it can be affected by buffering. I've often found that a program outputting thousands of lines to the screen runs faster when directing the output to a disk file.

For example, the following program:

#include <stdio.h>
int main (void) {
    int i;
    for (i = 100000; i > 0; i--)
        printf ("hello\n");
    return 0;
}

runs as:

pax$ time myprog
hello
hello
:
hello

real    0m12.861s
user    0m1.762s
sys     0m2.002s

pax$ time ./myprog >/tmp/qq

real    0m0.191s
user    0m0.160s
sys     0m0.050s

In other words, screen I/O in that environment (CygWin under XP) takes 67 times as long in duration and 17 times as long in CPU time (presumably because of all the windows updates).

by screen IO, I was rather considering a coloured window managed environement than a console. — Stephane Rolland, Feb 12 '11 at 12:58

Dave · Answer 2 · 2011-02-12T12:38:24.837

Here's another quick and interesting, if not scientifically reliable and not well thought out test:

char *memory;
NSLog (@"Start heap allocs");
for (int allocations = 0;  allocations < 100000000;  allocations++)
{
    memory = malloc (1024);
    memory[0] = 1;
    memory[1023] = memory[0] + 1;
    free(memory);
}
NSLog (@"End heap allocs");
NSLog (@"Start stack allocs");
for (int allocations = 0;  allocations < 100000000;  allocations++)
{
    char memory2 [1024];
    memory2[0] = 1;
    memory2[1023] = memory2[0] + 1;
}
NSLog (@"End stack allocs");

and the output:

2011-02-12 11:46:54.078 Veg Met Chilli[4589:207] Start heap allocs
2011-02-12 11:47:06.759 Veg Met Chilli[4589:207] End heap allocs
2011-02-12 11:47:06.759 Veg Met Chilli[4589:207] Start stack allocs
2011-02-12 11:47:07.057 Veg Met Chilli[4589:207] End stack allocs

Do the maths yourself, but that makes heap allocs about 42 times longer. I must stress DO NOT QUOTE ME on that, there are bound to be flaws in it! Most notably the relative time it takes to actually assign values into the data.

EDIT: New test data.

So now I'm simply calling a method for each heap and stack alloc, rather than having them immediately in the loop. Results:

2011-02-12 12:13:42.644 Veg Met Chilli[4678:207] Start heap allocs
2011-02-12 12:13:56.518 Veg Met Chilli[4678:207] End heap allocs
2011-02-12 12:13:56.519 Veg Met Chilli[4678:207] Start stack allocs
2011-02-12 12:13:57.842 Veg Met Chilli[4678:207] End stack allocs

This makes heap allocs only about 10 times as long as stack allocs. To make the results more accurate, I should also have a control method which does no memory allocation (but at least does something so as not to get optimised out), and take away that time. I'll do that next...

EDIT: Right... Now the code looks like this:

int control = 0;
NSLog (@"Start heap allocs");
for (int allocations = 0;  allocations < 100000000;  allocations++)
{
    control += [self HeapAlloc];
}
NSLog (@"End heap allocs");
NSLog (@"Start stack allocs");
for (int allocations = 0;  allocations < 100000000;  allocations++)
{
    control += [self StackAlloc];
}
NSLog (@"End stack allocs");
NSLog (@"Start no allocs");
for (int allocations = 0;  allocations < 100000000;  allocations++)
{
    control += [self NoAlloc];
}
NSLog (@"End no allocs");
NSLog (@"%d", control);


-(int) HeapAlloc
{
    int controlCalculation = rand();

    char *memory = malloc (1024);
    memory[0] = 1;
    memory[1023] = memory[0] + 1;
    free(memory);

    return controlCalculation;
}

-(int) StackAlloc
{
    int controlCalculation = rand();

    char memory [1024];
    memory[0] = 1;
    memory[1023] = memory[0] + 1;   

    return controlCalculation;
}

-(int) NoAlloc
{
    int controlCalculation = rand();

    return controlCalculation;
}

and the results are:

2011-02-12 12:31:32.676 Veg Met Chilli[4816:207] Start heap allocs
2011-02-12 12:31:47.306 Veg Met Chilli[4816:207] End heap allocs
2011-02-12 12:31:47.306 Veg Met Chilli[4816:207] Start stack allocs
2011-02-12 12:31:49.458 Veg Met Chilli[4816:207] End stack allocs
2011-02-12 12:31:49.459 Veg Met Chilli[4816:207] Start no allocs
2011-02-12 12:31:51.325 Veg Met Chilli[4816:207] End no allocs

So the control time is 1.866 seconds. Take that away from the alloc times gives: stack 0.286 seconds heap 12.764 seconds

So heap allocations take about 45 times as long as stack allocations.

Thank you and good night! :)

I'm not _entirely_ certain that that's a stack allocation each time through the loop. I've seen assembler code generated that decrements the stack for something like that _once_ at function start. That's certainly a valid optimisation - the only requirement is that the variable is _usable_ only within that block, not that the stack goes up and down by 1K every time through the loop. However, since the act of adding to the stack pointer is usually incredibly fast, your figures are probably close enough. — paxdiablo, Feb 12 '11 at 11:58
Yeah, it's likely that compiler optimisations invalidate this test in a number of ways! — Dave, Feb 12 '11 at 12:02
@paxdiablo, I just checked GCC assembly output with no optimizations, it indeed does allocate everything at the start of the function. Not that it matters though, as even updating the loop variable probably takes 10 times more time than changing the stack pointer register. — Sergei Tachenov, Feb 12 '11 at 12:05
@Dave, that would be a good idea. I know from experience that stack frame setup and teardown can sometimes be significant. — paxdiablo, Feb 12 '11 at 12:11
Heh-heh! I'm sure if I massaged the latest results, I could get it back down, because it's close enough! — Dave, Feb 12 '11 at 12:45
Maybe compiler optimization biases that test, but this are good test. Obviously my figure 1000 is too much overestimated... Many thanx for these tests. — Stephane Rolland, Feb 12 '11 at 12:55
@Dave, the 1st heap allocation will be much slower than the 2nd, then if you return the memory to the heap, you have all the memory in the CPU cache, the test is **flawed**. Again, write microbenchmark, if you understand what you test. Testing allocations in the cache in a single thread is just how not a real application works. — bestsss, Feb 12 '11 at 21:28
I agree, bestsss. I mentioned at least three times that this should not be taken as gospel. There's no substitute for just seeing how something affects your live code. — Dave, Feb 12 '11 at 21:40

score 2 · Accepted Answer · answered Feb 12 '11 at 11:59

The 1st point depends on a lot of things, really. If you run out of memory, then allocating something on the heap may take literally minutes. The stack, on the other hand, may already be allocated at that point.

The 2nd point depends on the terminal being used. Outputting to a DOS screen is one thing, outputting to a Windows console window is another, and xterm is something entirely different from them too.

As for the 3rd point, I'd rather say it's the other way around for modern hard disks. They can easily handle megs per second, how can you imagine outputting that much to any terminal in such short time? For small amounts of data you may be right, though, as hard drive I/O may take some time to prepare.

I choose this one as final answer, since it's kind of the most ponderated... although the two others are also interesting enough... and moderating... — Stephane Rolland, Feb 16 '11 at 09:03

CPU cost order of magnitude for some basic operations

3 Answers3

Linked