4

Theoretically, on modern CPUs which is faster:

  • receiving NOT result from table
  • or calculating it by ~ (in C) operation?

Presuming that all the table fits in L1 cache.

Bitwise not:

uint8_t bitwise_not(uint8_t arg) { return ~arg; }

Table not:

// precalculcating table (once)
uint8_t table[0x100];
for (int i = 0; i < 0x100; ++i) { table[i] = ~static_cast<uint8_t>(i); }

// function
uint8_t table_not(uint8_t arg) { return table[arg]; }

// xor_not:
uint8_t xor_not(uint8_t arg) { return arg ^ 0xff; }

On not a single operation, but several billions operations, is reading from L1 cache faster than any logical operation or not? (I think L1 is faster, but cannot prove it.)

Practically, how to measure it?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
vladon
  • 8,158
  • 2
  • 47
  • 91
  • 3
    Did you attempt to benchmark the code? It would probably be easier for someone to point out an error in your benchmarking technique, rather than writing a benchmark for you. – Jonny Henly Dec 15 '15 at 07:17
  • @JonnyHenly How to benchmark it correctly? It is always measured to 1 ns for both functions. Counting processor tacts (`rdtsc`) is not correct method for measuring, because tacts count says nothing about execution time (in 2015). – vladon Dec 15 '15 at 07:20
  • A single operation of this type is too fast to easily benchmark, so you'll have to repeat the operation millions or billions of times to get a time that is possible to measure. – Thomas Padron-McCarthy Dec 15 '15 at 07:37
  • If I measure the cycle (`get_start_point; for (i = 0; i < some_n; ++i) { ... }; get_end_point;`) it will likely measure the `for` cycle instructions, which makes no sense... – vladon Dec 15 '15 at 07:44
  • Yes, you'll measure the time for the loop instructions too, but you can see the _difference_ between a loop with the table method and a loop with the `~` method. – Thomas Padron-McCarthy Dec 15 '15 at 12:43
  • On hardware level boolean operations like `xor` or `not` are significantly simpler and faster that arithmetic addition which is required to calculate offset. So if CPU is designed properly boolean operations should be at least not slower in any case. – Slava Aug 24 '16 at 14:10

1 Answers1

12

Neither. Just use the ~ operator inline in your code. It's one machine instruction. A function call or a table lookup are several. There is no way either can possibly be faster.

I can't account for your strange belief that L1 cache is faster than registers.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Assume that the functions are inlined by compiler. I'm not asking which to use, it's more question of theory. – vladon Dec 15 '15 at 07:25
  • My answer remains the same. – user207421 Dec 15 '15 at 07:28
  • Yes, this is the obvious answer, so +1. Assuming, of course, that the particular processor _has_ a machine instruction for that, which, I am told, might not necessarily be true for small embedded systems. For a similar discussion, look at the comments here: http://stackoverflow.com/a/33910543/15727 -- so I guess the longer answer would start with "it depends". But that discussion was about reverse and not inverse, so perhaps all processors have a bitwise not instruction? – Thomas Padron-McCarthy Dec 15 '15 at 07:30
  • 3
    @ThomasPadron-McCarthy It will either have a NOT instruction or at worst you can do `XOR 0xff`, both of which will be faster than the alternatives given. – user207421 Dec 15 '15 at 07:33
  • @EJP I understand that on architectures where `NOT` exists, it is likely to be faster than table method. But when you do `arg XOR 0xFF` why you think it is faster than `table[arg]` (which is equal to reading from CPU register for L1 cache)? I measured it (using `std::chrono::high_resolution_clock`) on x86_64 (for UINT_MAX cycles), it is nearly the same (difference < 0.01%), sometimes one wins, sometimes other (may be precision error of clock). – vladon Dec 15 '15 at 07:42
  • 2
    @vladon - You have to compute `table[arg]` to get an address, and then read. That cannot be faster than computing `arg op constant` and then *not* read from memory (or cache). – Bo Persson Dec 15 '15 at 08:19
  • @BoPersson You think it cannot be, but it is nearly the same when benchmarking. – vladon Dec 15 '15 at 08:20
  • 1
    @vladon - I said it cannot be faster, not that it *had* to be slower. :-) When executing multiple instructions out of order, some extra code could possibly use execution units that would otherwise be unused. It depends. – Bo Persson Dec 15 '15 at 08:40
  • 1
    It is 'nearly the same when benchmarking' because you don't have enough iterations, so you have no precision in your measurement. NB `table[arg]` is *not* 'equivalent to reading from CPU register from L1 cache'. It requires two loads, an addition, and another load, which the instruction set will probably let you do in less than four instructions, but not all in registers. Any L1 cache that is fast as a CPU register would suggest that the architect got his registers and L1 cache back to front. It is the purpose of CPU registers to be the fastest. – user207421 Dec 16 '15 at 00:31
  • To be slightly fair, `table[arg]` only requires 1 or 2 instructions on a machine like x86-64 (https://godbolt.org/z/Y9oW356M1) that has indexed addressing modes that can include a register and a static address. The load execution unit handles address-generation so it just costs extra latency, not throughput. But yeah, for other ISAs like AArch64 that can't embed a 32-bit address into one load instruction, it takes multiple. And `~` or `^ 0xff` always compile to one simple fast ALU instruction that can never cache miss, not wasting 4 cache lines of L1d. Also being able to vectorize... – Peter Cordes Mar 17 '22 at 17:45