-2

I need to map int values 1,2,3 to chars 'C', 'M', 'A'

Whats the fastest way (this will be called 100s times per sec 24/7)?

a macro or an inline function and bunch of ?: operators or ifs or switch? or an array?

Boppity Bop
  • 9,613
  • 13
  • 72
  • 151
  • 13
    `int map[4] = { 0, 'C', 'M', 'A' };` – WhozCraig Mar 30 '22 at 16:41
  • yeah you beat me to the edit! :) if you are sure - write it as an answer – Boppity Bop Mar 30 '22 at 16:42
  • 14
    100s of times per second is not very many times per second. You have a computer that can probably do it 1000000000 times per second. Per core. – user253751 Mar 30 '22 at 16:43
  • 2
    I think C64 Basic was fast enough for that.... Implement it any way you can think of. Then try. Use your favorite search engine on "premature optimisation". – Yunnosch Mar 30 '22 at 16:43
  • there are a lot of other things go in each call. optimisation is a must in my case. – Boppity Bop Mar 30 '22 at 16:45
  • If that translation is really static like that, your lookup table can be declared constexpr as well. – WhozCraig Mar 30 '22 at 16:45
  • 1
    Where does the numbers 1,2,3 come from? – Ted Lyngmo Mar 30 '22 at 16:46
  • Your compiler will optimize such trivialities. If it does not, and you have performance issue, then start profiling and optimizing. – YSC Mar 30 '22 at 16:47
  • "there are a lot of other things go in each call". If that fills up your second, then "optimisation" ... elsewhere... "is a must" . – Yunnosch Mar 30 '22 at 16:49
  • @BoppityBop your computer can do this a billion times a second. And you say it's only doing this 100s of times a second (let's be generous and say it's 999 times). So what is the computer doing for the other 99.9999001% of the time? Maybe you should optimize that thing instead – user253751 Mar 30 '22 at 16:59
  • `static_cast(35+(43-11*x)*x)` (though a lookup table or switch+case is probably going to be faster) – Artyer Mar 30 '22 at 17:05
  • 1
    @Artyer -- a lookup table or switch+case also doesn't hard-wire assumptions about the character encoding. – Pete Becker Mar 30 '22 at 17:17
  • @user253751 Yes, but it says 100s times per second 24/7. So if it runs long enough you can put a price tag on the additional energy wasted on a slow computation. – bitmask Mar 30 '22 at 18:21
  • @bitmask no matter how long you run it, the other part of the program uses at least a million times more energy – user253751 Mar 30 '22 at 18:33

3 Answers3

5

A lookup-table seems the most obvious approach as it is also branch-free:

constexpr char map(std::size_t i) {
  constexpr char table[] = "0CMA";
  // if in doubt add bounds checking for i, but it will cost performance
  return table[i];
}

Observe that with optimisation, the lookup table boils down to an integer constant.

Edit: You can shave of an additional instruction if you are less lazy than me and specify the lookup table thusly:

  constexpr char table[] = {0, 'M', 'C', 'A'};
bitmask
  • 32,434
  • 14
  • 99
  • 159
  • @user253751 I don't know. If in doubt one would have to benchmark. It *does* get compiled to the same number of instructions, tough. And I would argue that magic constants are less maintainable, so there is that. – bitmask Mar 30 '22 at 18:39
1

My proposal: (char)(0x414D4300 >> (i * 8))

Instead of 0x414D4300 you could write (('C' << 8) | ('M' << 16) | ('A' << 24)).

Only actual testing will tell you whether this is faster than bitmask's answer.

If this runs only 100 times per second, you're chasing a red herring, anyway. Even if you write this section in the dumbest way possible, it seems to be a million times faster than the rest of your program.

user253751
  • 57,427
  • 7
  • 48
  • 90
  • 3
    How to obfuscate your code for no benefit at all 101 – Aykhan Hagverdili Mar 30 '22 at 19:35
  • If you look at the code that `bitmask` linked, you'd see the same (or very similar) constant produced by compiler. – Vlad Feinstein Mar 31 '22 at 01:01
  • @VladFeinstein I also see a word store and byte load instead of a bit-shift. – user253751 Mar 31 '22 at 08:20
  • luckily, it sounds like [store forwarding does work in such cases](https://stackoverflow.com/questions/46135766/can-modern-x86-implementations-store-forward-from-more-than-one-prior-store), which I was concerned about. If the processor did not implement store forwarding in this scenario, the load could waste 10 to 50 clock cycles. – user253751 Mar 31 '22 at 08:26
  • _"Only actual testing will tell you whether this is faster than bitmask's answer"_ - [Benchmark](https://quick-bench.com/q/KvV5dv8RLjl731n9P85WWUaoZMc) – Ted Lyngmo Apr 04 '22 at 06:48
  • @TedLyngmo Well that is interesting! The memory access method is *faster* than the bitshift method. Edit: although in the bitshift method, a greater percentage of time is counted in the benchmark overhead. Perhaps both are so fast that out-of-order execution executes them in parallel with the benchmark overhead and nothing is really measured. I note that the %3 was optimized in a way that involves a bit-shift so perhaps having two shift instructions instead of one occupies the shifter unit for longer. – user253751 Apr 04 '22 at 08:46
  • Yes, and I got _some_ results where the bitshifts were faster when I ran the benchmark so I can't say for sure which is faster. There's also some diff between g++ and clang++. – Ted Lyngmo Apr 04 '22 at 08:51
0

You're optimizing your code prematurely. Let me prove it. Here's the naivest, slowest approach that is still reasonable:

#include <map>

char map(int i) {
    std::map<int, char> table;
    table[1] = 'C';
    table[2] = 'M';
    table[3] = 'A';

    return table[i];
}

It generates hundreds of lines of assembly instructions.

And let's time it for a million elements:

#include <iostream>
#include <map>
#include <chrono>
#include <cstdlib>
#include <vector>

char map(int i) {
    std::map<int, char> table;
    table[1] = 'C';
    table[2] = 'M';
    table[3] = 'A';

    return table[i];
}

int main() {
    int const count = 1'000'000;
    std::vector<int> elms(count);
    
    std::srand(0); // I know that rand is no good, but it's OK for this example
    for (int i = 0; i < count; ++i) {
        elms[i] = std::rand() % 3 + 1;
    }

    auto beg = std::chrono::high_resolution_clock::now();
    volatile int do_not_optimize;
    for (int i = 0; i < count; ++i) {
        do_not_optimize = map(elms[i]);
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - beg);
    std::cout << "Duration: " << duration.count() << " ms" << std::endl;
}

Here's the output on my computer with lots of other programs running alongside:

Duration: 246 ms

This terrible implementation under terrible conditions still meets your needs 40'650x over. I bet you have way more to worry about elsewhere in your codebase. Remember to make it right, then make it fast only if it's not already fast enough.

Aykhan Hagverdili
  • 28,141
  • 6
  • 41
  • 93