-2

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.

I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.

int i;
int n;

i = 0;
while (i < 999999999)
{
    n = 0;
    i++;
    n++;
}

I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.

int i;
int n;

i = 0;
n = 5;
while (i < 999999999)
{
    n *= -1;
    i++;
    n++;
}
return (0);

I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.

If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.

  • 2
    Apart from various other things, `n *= -1` is obviously expected to be slower since it's a read-modify-write while `n = 0` is just a write... Also I can tell you did not compile with optimization enabled so the whole comparison is pointless. – Jester Jan 20 '18 at 19:32
  • I'm not allowed to use optimization flags for my own code, so I figured doing it without would be more practical for personal purposes. I think I was thinking about it too much in terms of "operations" and "setting one thing to another" instead of read/modify/write. –  Jan 20 '18 at 19:34
  • *I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another* well setting one thing to another just copies the bits. Changing the sign of a 2-complement number is more complex than just flipping the msb, it involves addition too, hence it takes longer. – Pablo Jan 20 '18 at 19:38
  • 7
    There's negative (not even zero, as you may mislead others later) point to discuss performance of C source without optimizations enabled. Whatever you are doing, you are doing it extremely wrong, which leads you to wrong measure results, which will lead you to wrong assumptions, and knowledge. You even don't test multiplying, as `*(-1)` is just negation, so you are doing `n = -n;` (that much is optimized by compiler even when optimizations are disabled). – Ped7g Jan 20 '18 at 20:07
  • 2
    Unclear what you're asking - optimization flags are fundamental for performance. The " not allowed to use optimization flags"remark is a red flag and signals a misunderstanding that needs to be cleared first. – MSalters Jan 21 '18 at 03:09

2 Answers2

2

Assigning zero is very cheap.

But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.

Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)

And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.


You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.

Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).


That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.

  • n = 0; is write-only, and breaks the dependency on the old value.

  • n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.

The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.

So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.

The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).


This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

About the code itself

I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.

Code 1:

// C
int main() {
    int n;
    for (int i = 0; i < 999999999; i++) {
        n = 0;
        n++;
    }
}

// assembly
main:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], 0
  jmp .L2
.L3:
  mov DWORD PTR [rbp-8], 0
  add DWORD PTR [rbp-8], 1
  add DWORD PTR [rbp-4], 1
.L2:
  cmp DWORD PTR [rbp-4], 999999998
  jle .L3
  mov eax, 0
  pop rbp
  ret

Code 2:

// C
int main() {
    int n = 5;
    for (int i = 0; i < 999999999; i++) {
        n *= -1;
        n++;
    }
}

// assembly
main:
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], 5
  mov DWORD PTR [rbp-8], 0
  jmp .L2
.L3:
  neg DWORD PTR [rbp-4]
  add DWORD PTR [rbp-4], 1
  add DWORD PTR [rbp-8], 1
.L2:
  cmp DWORD PTR [rbp-8], 999999998
  jle .L3
  mov eax, 0
  pop rbp
  ret

The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.

According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another. Thanks to aschepler in the comments for the input.

This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.

About your intent

Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.

The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.

My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

Nelfeal
  • 12,593
  • 1
  • 20
  • 39
  • It is not the case that two different opcodes usually have similar performance. According to http://agner.org/optimize/instruction_tables.pdf , for some AMD chips, MOV to a computed memory address from a constant has latency 3, but NEG on a computed memory address has latency 7. On the other hand, on an Intel Pentium, that type of MOV takes one clock cycle and that type of NEG takes 1/3 clock cycle. Though the situation is much more complicated than just a ratio of those numbers. – aschepler Jan 20 '18 at 20:52
  • @aschepler Thanks for the input. I knew some complex instructions could take more time to execute than simple ones, but I thought two simple instructions like MOV and NEG would take about the same amount of time. Edited. – Nelfeal Jan 20 '18 at 22:08
  • MOV is write-only, NEG is read-modify-write! That's a *huge* difference. Repeated MOV to the same address has no dependency on the old one, so can run at one per clock. Repeated NEG has ALU (1 cycle) + store-forwarding latency (5 cycles on many CPUs). Of course, un-optimized code that keeps the loop counter in memory has the same store-forwarding bottleneck limiting the loop to about one iteration per 6 cycles, which is why [benchmarking un-optimized code is a waste of time in the first place](https://stackoverflow.com/questions/32000917/c-loop-optimization-help-for-final-assignment). – Peter Cordes Jan 21 '18 at 08:25