What's the fastest way to convert hex to integer in C++?

Question

I'm trying to convert a hex char to integer as fast as possible.

This is only one line: int x = atoi(hex.c_str);

Is there a faster way?

Here, I have tried a more dynamic approach, and it's slightly faster.

int hextoint(char number) {
    if (number == '0') {
        return 0;
    }
    if (number == '1') {
        return 1;
    }
    if (number == '2') {
        return 2;
    }
    /*
     *  3 through 8
     */
    if (number == '9') {
        return 9;
    }
    if (number == 'a') {
        return 10;
    }
    if (number == 'b') {
        return 11;
    }
    if (number == 'c') {
        return 12;
    }
    if (number == 'd') {
        return 13;
    }
    if (number == 'e') {
        return 14;
    }
    if (number == 'f') {
        return 15;
    }
    return -1;
}

Your approach converts only one digit and is ugly. And we have [`std::stoi`](http://en.cppreference.com/w/cpp/string/basic_string/stol) now. — LogicStuff, Dec 19 '15 at 00:03
@Mitch Wheat I did time the two above, I'm wondering if there is a better method I haven't thought of. — 54 69 6D, Dec 19 '15 at 00:05
[http://en.cppreference.com/w/cpp/string/basic_string/stol](http://en.cppreference.com/w/cpp/string/basic_string/stol). — spencer7593, Dec 19 '15 at 00:05
Do benchmarks: "Don't make statements about "efficiency" of code without first doing time measurements. Guesses about performance are most unrealiable". - Bjarne Stroustrup, — jensa, Dec 19 '15 at 00:08
if the above method is your actual bottleneck (and I doubt it) then you must be doing something very specialised. And yes, there is a much faster way.... — Mitch Wheat, Dec 19 '15 at 00:12
@MitchWheat: What's faster than a lookup table? You know a decent compiler will generate the same code for this as it would a `switch`, right? — Lightness Races in Orbit, Dec 19 '15 at 00:32
You're comparing the number again and again, which is slow (although the compiler might optimize it) and ugly. And how did you time the function? Microbenchmarking is not easy — phuclv, Dec 19 '15 at 02:06
@ Lightness Races in Orbit: where did I imply I was not referring to a lookup table? — Mitch Wheat, Dec 19 '15 at 03:52
For OP's very specialized "convert a character (known to be a hex digit) to binary", I'd expect this code to be fast: int x=c-'a'; ... (x<0?c-'0':x)... One subtract, one (predicted) branch, possible additional subtract, no memory touches. If the branch isn't predictable, a predicated version would avoid the branch. — Ira Baxter, Dec 19 '15 at 21:55

101010 · Accepted Answer · 2015-12-23T09:07:02.400

Proposed Solutions that Render Faster than the OP's if-else:

Unordered Map Lookup Table

Provided that your input strings are always hex numbers you could define a lookup table as an unordered_map:

std::unordered_map<char, int> table {
{'0', 0}, {'1', 1}, {'2', 2},
{'3', 3}, {'4', 4}, {'5', 5},
{'6', 6}, {'7', 7}, {'8', 8},
{'9', 9}, {'a', 10}, {'A', 10},
{'b', 11}, {'B', 11}, {'c', 12},
{'C', 12}, {'d', 13}, {'D', 13},
{'e', 14}, {'E', 14}, {'f', 15},
{'F', 15}, {'x', 0}, {'X', 0}};

int hextoint(char number) {
  return table[(std::size_t)number];
}

Lookup Table as user constexpr literal (C++14)

Or if you want something more faster instead of an unordered_map you could use the new C++14 facilities with user literal types and define your table as a literal type at compile time:

struct Table {
  long long tab[128];
  constexpr Table() : tab {} {
    tab['1'] = 1;
    tab['2'] = 2;
    tab['3'] = 3;
    tab['4'] = 4;
    tab['5'] = 5;
    tab['6'] = 6;
    tab['7'] = 7;
    tab['8'] = 8;
    tab['9'] = 9;
    tab['a'] = 10;
    tab['A'] = 10;
    tab['b'] = 11;
    tab['B'] = 11;
    tab['c'] = 12;
    tab['C'] = 12;
    tab['d'] = 13;
    tab['D'] = 13;
    tab['e'] = 14;
    tab['E'] = 14;
    tab['f'] = 15;
    tab['F'] = 15;
  }
  constexpr long long operator[](char const idx) const { return tab[(std::size_t) idx]; } 
} constexpr table;

constexpr int hextoint(char number) {
  return table[(std::size_t)number];
}

Live Demo

Benchmarks:

I ran benchmarks with the code written by Nikos Athanasiou that was posted recently on isocpp.org as a proposed method for C++ micro-benchmarking.

The algorithms that were compared are:

1. OP's original if-else:

long long hextoint3(char number) {
  if(number == '0') return 0;
  if(number == '1') return 1;
  if(number == '2') return 2;
  if(number == '3') return 3;
  if(number == '4') return 4;
  if(number == '5') return 5;
  if(number == '6') return 6;
  if(number == '7') return 7;
  if(number == '8') return 8;
  if(number == '9') return 9;
  if(number == 'a' || number == 'A') return 10;
  if(number == 'b' || number == 'B') return 11;
  if(number == 'c' || number == 'C') return 12;
  if(number == 'd' || number == 'D') return 13;
  if(number == 'e' || number == 'E') return 14;
  if(number == 'f' || number == 'F') return 15;
  return 0;
}

2. Compact if-else, proposed by Christophe:

long long hextoint(char number) {
  if (number >= '0' && number <= '9') return number - '0';
  else if (number >= 'a' && number <= 'f') return number - 'a' + 0x0a;
  else if (number >= 'A' && number <= 'F') return number - 'A' + 0X0a;
  else return 0;
}

3. Corrected ternary operator version that handles also capital letter inputs, proposed by g24l:

long long hextoint(char in) {
  int const x = in;
  return (x <= 57)? x - 48 : (x <= 70)? (x - 65) + 0x0a : (x - 97) + 0x0a;
}

4. Lookup Table (unordered_map):

long long hextoint(char number) {
  return table[(std::size_t)number];
}

where table is the unordered map shown previously.

5. Lookup Table (user constexpr literal):

long long hextoint(char number) {
  return table[(std::size_t)number];
}

Where table is user defined literal as shown above.

Experimental Settings

I defined a function that transforms an input hex string to an integer:

long long hexstrtoint(std::string const &str, long long(*f)(char)) {
  long long ret = 0;
  for(int j(1), i(str.size() - 1); i >= 0; --i, j *= 16) {
    ret += (j * f(str[i]));
  }
  return ret;
}

I also defined a function that populates a vector of strings with random hex strings:

std::vector<std::string>
populate_vec(int const N) {
  random_device rd;
  mt19937 eng{ rd() };
  uniform_int_distribution<long long> distr(0, std::numeric_limits<long long>::max() - 1);
  std::vector<std::string> out(N);
  for(int i(0); i < N; ++i) {
    out[i] = int_to_hex(distr(eng));
  }
  return out;
}

I created vectors populated with 50000, 100000, 150000, 200000 and 250000 random hex strings respectively. Then for each algorithm I run 100 experiments and averaged the time results.

Compiler was GCC version 5.2 with optimization option -O3.

Results:

Discussion

From the results we can conclude that for these experimental settings the proposed table method out-performs all the other methods. The if-else method is by far the worst where as the unordered_map although it wins the if-else method it is significantly slower than the other proposed methods.

CODE

Edit:

Results for method proposed by stgatilov, with bitwise operations:

long long hextoint(char x) {
    int b = uint8_t(x);
    int maskLetter = (('9' - b) >> 31);
    int maskSmall = (('Z' - b) >> 31);
    int offset = '0' + (maskLetter & int('A' - '0' - 10)) + (maskSmall & int('a' - 'A'));
    return b - offset;
}

Edit:

I also tested the original code from g24l against the table method:

long long hextoint(char in) {
  long long const x = in;
  return x < 58? x - 48 : x - 87;
}

Note that this method doesn't handle capital letters A, B, C, D, E and F.

Results:

Still the table method renders faster.

@101010: No explanation why you think this is the *fastest*, or in *what* circumstances you *think* it's the fastest. Example: if this gets called rarely, and you have to wait for memory, it's going to be slower than any solution based on logic+math. — Karoly Horvath, Dec 19 '15 at 20:53
@KarolyHorvath I guess you didn't read the OP's comment "The literal type table works best for me" — 101010, Dec 19 '15 at 20:55
@101010: Yeah, thanks, I missed that. Whether I should trust his assessment is another matter though... — Karoly Horvath, Dec 19 '15 at 20:57
@KarolyHorvath Fair enough, we are scientists after all, I'll try to add some benchmarks :) — 101010, Dec 19 '15 at 20:58
@101010: You'll probably win in a microbenchmark (hot cache). Doing a realistic benchmarks is quite tough, even if we know the scenario in which this is going to be used. Which we don't... — Karoly Horvath, Dec 19 '15 at 21:01
@KarolyHorvath I this is the case, which it is, you would have to down-vote all the answers posted ;) — 101010, Dec 19 '15 at 21:08
@101010: Maybe. TBH, I like the current look of the page. There's an accepted answer but with a lower score, which will *hopefully* force the reader to check the details/comments. If you feel it's unfair, feel free to downvote and continue the avalanche I've started... — Karoly Horvath, Dec 19 '15 at 21:15
Could you please check if your compiler generates branches in ternary operator version? I'm not surprised "compact if-else" version is very slow on random tests, but well-done ternary operator versions must be significantly faster. — stgatilov, Dec 21 '15 at 04:27
In the so called corrected version, you are making extra comparisons. It is assumed that the inpu is hex. If it has illegal chars that is another issue. The point is that the lookup table has to define the extra elements. — g24l, Dec 21 '15 at 10:59
So why is there a 14ms at 450K and a 16ms at 400K , please explain. — g24l, Dec 22 '15 at 18:35
Apart from correcting the discrepancies in your tests by hand to align them to your liking, making claims such as X solution does not handle errors in the input are just invalid. Returning 0 where the input is invalid is just as good as returning any other number since 0 is within the set of valid values. — g24l, Dec 23 '15 at 04:58
I edited you answer so that there are references to the original work of others. Apart from having written a page that spans 4-5 screens tall, you are not permitted to include work of others and make it your own. Especially when your testing method is dubious. — g24l, Dec 23 '15 at 13:04
@g24l As far as I understand [this help article](http://stackoverflow.com/help/licensing) 101010 *is indeed* allowed to include any content from SO including your work. — TobiMcNamobi, Dec 23 '15 at 14:12
@TobiMcNamobi "But our cc-wiki licensing, while intentionally permissive, does require attribution." pasting a derived version of code is not proper attribution as stated; a hyperlink is. — g24l, Dec 23 '15 at 14:20
All of these comparisons and not a single benchmark against [`std::stoi`](http://en.cppreference.com/w/cpp/string/basic_string/stol) — Zac Howland, Jan 10 '16 at 20:51
Even proposing `unordered_map` for such simple problem is overkill (unless you were curious to test your unordered_map implementation). Why not simply using `char tab[256] = { -1, -1, ... 0, 1, 2, 3, 4, ... -1, -1, ... 10, 11, 12...}`. Also, proposed example solutions aren't equivalent: some won't handle unexpected data, and some will. Benchmark results on arm (mobiles) would most likely be different also. — Pavel P, Jun 18 '18 at 22:20
`h=(c>>6);h|=(h<<3);h+=(c&15);` Explanation + overflow-free flags: https://gist.github.com/jcdickinson/9a4205287ae107e9f4e5f6764f432db9 — Jonathan Dickinson, Oct 04 '18 at 06:12
Note that ISO C, at least (not sure about C++), requires `'0'`...`'9'` to be consecutive, but not `'A'`...`'Z'`, so method #2 is non-standard, even if it works on most systems. Method #3 is worse, as it uses numeric ASCII codes. — Alex Shpilkin, Jan 28 '19 at 14:06
You want `unsigned char number` so it zero-extends to a `size_t` in the 0..255 range, not -128UL to `+127UL`. Or cast like `(size_t)(unsigned char)number`. Unless `table` is a pointer to the middle of an array, or you're fine with UB on high-ASCII characters... — Peter Cordes, Mar 28 '20 at 20:09
table solution is a little cheating, because of every other method provide us input control (mean we can check every symbol and throw exception or whatever), when table do not, so table is not preferable if you need validate input chars — goldstar, Feb 05 '21 at 13:42

score 13 · Answer 2 · edited May 23 '17 at 12:02

This question may evidently have different answers on different systems, and in this sense it is ill-posed from the start. For example an i486 has no pipeline and a pentium has no SSE.

The correct question to ask would be: " what is the fastest way to convert a single char hex to dec in X system , e.g. i686 " .

Among approaches herein, the answer to this is actually the same or very very very nearly the same on a system with a multi-stage pipeline. Any system without a pipeline will bend towards the lookup table method (LUT), but if memory access is slow the conditional method (CEV), or the bitwise evaluation method (BEV), may profit depending of the speed of xor vs load for the given CPU.

(CEV) decomposes into 2 load effective addresses a comparison and a conditional move from registers which is not prone to mis-prediction. All these commands are pairable in the pentium pipeline. So they actually go in 1-cycle.

  8d 57 d0                lea    -0x30(%rdi),%edx
  83 ff 39                cmp    $0x39,%edi
  8d 47 a9                lea    -0x57(%rdi),%eax
  0f 4e c2                cmovle %edx,%eax

The (LUT) decomposes into a mov between registers and mov from a data dependent memory location plus some nops for alignment, and should take the minimum of 1-cycle. As the previous there are only data dependencies.

  48 63 ff                movslq %edi,%rdi
  8b 04 bd 00 1d 40 00    mov    0x401d00(,%rdi,4),%eax

The (BEV) is a different beast as it actually requires 2 movs + 2 xors + 1 and a conditional mov. These can also be nicely pipelined.

  89 fa                   mov    %edi,%edx
  89 f8                   mov    %edi,%eax
  83 f2 57                xor    $0x57,%edx
  83 f0 30                xor    $0x30,%eax
  83 e7 40                and    $0x40,%edi
  0f 45 c2                cmovne %edx,%eax

Of course, it is a very rare occasion that it is application critical (maybe Mars Pathfinder is a candidate) to convert just a signle char. Instead one would expect to convert a larger string by actually making a loop and calling that function.

Thus on such a scenario the code that is better vectorizable is the winner. The LUT does not vectorize, and BEV and CEV have better behaviour. In general such micro-optimization does not get you anywhere, write your code and let live (i.e. let the compiler run).

So I have actually built some tests in this sense that are easily reproducible on any system with a c++11 compiler and a random device source,such as any *nix system. If one does not permit vectorization -O2 CEV/LUT are almost equal, but once -O3 is set the advantage of writing code that is more decomposable shows the difference.

To summarised, if you have an old compiler use LUT, if your system is low-end or old consider BEV, otherwise the compiler will outsmart you and you should use CEV.

Problem: in question is to convert from the set of chars {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} to the set of {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}. There are no capital letters under consideration.

The idea is to take advantage of the linearity of the ascii table in segments.

[Simple and easy]: Conditional evaluation -> CEV

int decfromhex(int const x)
{
return x<58?x-48:x-87;
}

[Dirty and complex]: Bitwise evaluation -> BEV

int decfromhex(int const x)
{
return 9*(x&16)+( x & 0xf  );
}

[compile time]: Template conditional evaluation -> TCV

template<char n> int decfromhex()
{
  int constexpr x = n;
  return x<58 ? x-48 : x -87;
}

[Lookup table]: Lookup table -> LUT

int decfromhex(char n)
{
static int constexpr x[255]={
           // fill everything with invalid, e.g. -1 except places\
           // 48-57 and 97-102 where you place 0..15 
           };
return x[n];
}

Among all , the last seems to be the fastest at first look. The second is only at compile time and constant expression.

[RESULT] (Please verify): *BEV is the fastest among all and handles lower and upper case letter , but only marginal to CEV which does not handle capital letters. LUT becomes slower than both CEV and BEV as the size of the string increases.

An exemplary result for str-sizes 16-12384 can be found below ( lower is better )

The average time (100 runs ) along is show. The size of the bubble is the normal error.

The script for running the tests is available.

Tests have been performed for the conditional CEV, the bitwise BEV and the lookup table LUT on a set of randomly generated strings. The tests are fairly simple and from:

Test source code

these are verifiable:

A local copy of the input string is placed in a local buffer every time.
A local copy of the results is kept that is then copied to the heap for every string test
Duration only for the time operating on the string is extracted
uniform approach, there is no complicated machinery and wrap/around code that is fitted for other cases.
no sampling the entire timing sequence is used
CPU preheating is executed
Sleep between tests occur to permit marshal the code such that one test does not take advantage of the previous test.
Compilation is performed with g++ -std=c++11 -O3 -march=native dectohex.cpp -o d2h
Launch with taskset -c 0 d2h
No thread dependencies or multithreading
The results are actually being used, to avoid any type of loop optimization

As a side note I have seen in practice version 3 to be much faster with older c++98 compilers.

[BOTTOM LINE]: use CEV without fear, unless you know your variables at compile time where you could use version TCV. The LUT should only be used after significant performance per use case evaluation, and probably with older compilers. Another case is when your set is larger i.e. {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,A,B,C,D,E,F} . This can be achieved as well. Finally if you are permormance hungry use BEV .

The results with the unordered_map have been removed since they have been just too slow to compare, or at best case may be as fast as the LUT solution.

Results from my personal pc on strings of size 12384/256 and for 100 strings:

 g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -2709
-------------------------------------------------------------------
(CEV) Total: 185568 nanoseconds - mean: 323.98 nanoseconds  error: 88.2699 nanoseconds
(BEV) Total: 185568 nanoseconds - mean: 337.68 nanoseconds  error: 113.784 nanoseconds
(LUT) Total: 229612 nanoseconds - mean: 667.89 nanoseconds  error: 441.824 nanoseconds
-------------------------------------------------------------------


g++ -DS=2 -DSTR_SIZE=12384 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native hextodec.cpp -o d2h && taskset -c 0 ./h2d

-------------------------------------------------------------------
(CEV) Total: 5539902 nanoseconds - mean: 6229.1 nanoseconds error: 1052.45 nanoseconds
(BEV) Total: 5539902 nanoseconds - mean: 5911.64 nanoseconds    error: 1547.27 nanoseconds
(LUT) Total: 6346209 nanoseconds - mean: 14384.6 nanoseconds    error: 1795.71 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns

The results from a system with GCC 4.9.3 compiled to the metal without the system being loaded on strings of size 256/12384 and for 100 strings

g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -2882
-------------------------------------------------------------------
(CEV) Total: 237449 nanoseconds - mean: 444.17 nanoseconds  error: 117.337 nanoseconds
(BEV) Total: 237449 nanoseconds - mean: 413.59 nanoseconds  error: 109.973 nanoseconds
(LUT) Total: 262469 nanoseconds - mean: 731.61 nanoseconds  error: 11.7507 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns


g++ -DS=2 -DSTR_SIZE=12384 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -137532
-------------------------------------------------------------------
(CEV) Total: 6834796 nanoseconds - mean: 9138.93 nanoseconds    error: 144.134 nanoseconds
(BEV) Total: 6834796 nanoseconds - mean: 8588.37 nanoseconds    error: 4479.47 nanoseconds
(LUT) Total: 8395700 nanoseconds - mean: 24171.1 nanoseconds    error: 1600.46 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns

[HOW TO READ THE RESULTS]

The mean is shown on microseconds required to compute the string of the given size.

The total time for each test is given. The mean is computed as the sum/total of timings to compute one string ( no other code in that region but could be vectorized, and that's ok) . The error is the standard deviation of the timings.

The mean tell us what we should expect on average , and the error how much the timings have been following normality. In this case this is a fair measure of error only when it is small ( otherwise we should use something appropriate for positive distributions ). One usually should expect high errors in case of cache miss , processor scheduling, and many other factors.

The code has a unique macro defined to run the tests, permits to define compile time variables to set up the tests, and prints complete information such as:

g++ -DS=2 -DSTR_SIZE=64 -DSET_SIZE=1000 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -6935
-------------------------------------------------------------------
(CEV) Total: 947378 nanoseconds - mean: 300.871 nanoseconds error: 442.644 nanoseconds
(BEV) Total: 947378 nanoseconds - mean: 277.866 nanoseconds error: 43.7235 nanoseconds
(LUT) Total: 1040307 nanoseconds - mean: 375.877 nanoseconds    error: 14.5706 nanoseconds
-------------------------------------------------------------------

For example to run the test with a 2sec pause on a str of size 256 for a total of 10000 different strings, output timings in double precision, and count in nanoseconds the following command compiles and runs the test.

g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=10000 -DUTYPE=double -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h

I wonder why in the first case result is 2280 us, and in the second case it is 2050 us, although the codes are exactly the same. I'm afraid you have some issues in your benchmarking method. Unfortunately, I cannot test it locally as is (MinGW has no high-resolution timer). If I put each benchmarked part in yet another loop with 1000 iterations, I get values 539031 and 1314075 on my Ivy Bridge machine with GCC 4.8.3 and same compiler keys. — stgatilov, Dec 21 '15 at 07:52
@stgatilov , you can test yourself and decide. That is what the answer says here. The code is available, you can get a very nice g++ compiler by installing linux on a different bootable partition. — g24l, Dec 21 '15 at 11:08
Is your BEV correct, or should it be `9 * (x >> 6) + (x & 0xf)`? Your pastebin also uses a ternary for decfromhex2. — ZachB, Sep 09 '17 at 06:49
You probably want `unsigned char n` so it indexes the 0..255 range, not -128 to +127. — Peter Cordes, Mar 28 '20 at 20:05

stgatilov · Answer 3 · 2015-12-21T06:18:33.373

Well, that's a weird question. Converting a single hex char into an integer is so fast, that it is really hard to tell which is faster, because all methods are almost likely faster than the code you write in order to use them =)

I'll assume the following things:

We have a modern x86(64) CPU.
The input character's ASCII code is stored in a general purpose register, e.g. in eax.
The output integer must be obtained in a general purpose register.
The input character is guaranteed to be a valid hex digit (one of 16 cases).

Solution

Now here are several methods for solving the problem: the first one based on lookup, two based on ternary operator, the last one based on bit operations:

int hextoint_lut(char x) {
    static char lut[256] = {???};
    return lut[uint8_t(x)];
}

int hextoint_cond(char x) {
    uint32_t dig = x - '0';
    uint32_t alp = dig + ('0' - 'a' + 10);
    return dig <= 9U ? dig : alp;
}
int hextoint_cond2(char x) {
    uint32_t offset = (uint8_t(x) <= uint8_t('9') ? '0' : 'a' - 10);
    return uint8_t(x) - offset;
}

int hextoint_bit(char x) {
    int b = uint8_t(x);
    int mask = (('9' - b) >> 31);
    int offset = '0' + (mask & int('a' - '0' - 10));
    return b - offset;
}

Here are the corresponding assembly listings generated (only the relevant parts are shown):

;hextoint_lut;
movsx   eax, BYTE PTR [rax+rcx]   ; just load the byte =)

;hextoint_cond;
sub edx, 48                       ; subtract '0'
cmp edx, 9                        ; compare to '9'
lea eax, DWORD PTR [rdx-39]       ; add ('0' - 'a' + 10)
cmovbe  eax, edx                  ; choose between two cases in branchless way

;hextoint_cond2;                  ; (modified slightly)
mov eax, 48                       
mov edx, 87                       ; set two offsets to registers
cmp ecx, 57                       ; compare with '9'
cmovbe  edx, eax                  ; choose one offset
sub ecx, edx                      ; subtract the offset

;hextoint_bit;
mov ecx, 57                       ; load '9'
sub ecx, eax                      ; get '9' - x
sar ecx, 31                       ; convert to mask if negative
and ecx, 39                       ; set to 39 (for x > '9')
sub eax, ecx                      ; subtract 39 or 0
sub eax, 48                       ; subtract '0'

Analysis

I'll try to estimate number of cycles taken by each approach in throughput sense, which is essentially the time spent per one input number when a lot of numbers are processed at once. Consider a Sandy Bridge architecture as an example.

The hextoint_lut function consists of a single memory load, which takes 1 uop on port 2 or 3. Both of these ports are dedicated to memory loads, and they also have address calculation inside, which are capable of doing rax+rcx with no additional cost. There are two such ports, each can do one uop in a cycle. So supposedly this version would take 0.5 clock time. If we have to load input number from memory, that would require one more memory load per value, so the total cost would be 1 clock.

The hextoint_cond version has 4 instructions, but cmov is broken into two separate uops. So there are 5 uops in total, each can be processed on any of the three arithmetic ports 0, 1, and 5. So supposedly it would take 5/3 cycles time. Note that memory load ports are free, so the time would not increase even if you have to load the input value from memory.

The hextoint_cond2 version has 5 instructions. But in a tight loop the constants can be preloaded to registers, so there would be only comparison, cmov and subtraction. They are 4 uops in total, giving 4/3 cycles per value (even with memory read).

The hextoint_bit version is a solution which is guaranteed to have no branches and lookup, which is handy if you do not want to check always whether your compiler generated a cmov instruction. The first mov is free, since the constant can be preloaded in a tight loop. The rest are 5 arithmetic instructions, which a 5 uops in ports 0, 1, 5. So it should take 5/3 cycles (even with a memory read).

Benchmark

I have performed a benchmark for the C++ functions described above. In a benchmark, 64 KB of random data is generated, then each function is run many times on this data. All the results are added to checksum to ensure that compiler does not remove the code. Manual 8x unrolling is used. I have tested on a Ivy Bridge 3.4 Ghz core, which is very similar to Sandy Bridge. Each string of output contains: name of function, total time taken by benchmark, number of cycles per input value, sum of all outputs.

Benchmark code

MSVC2013 x64 /O2:
hextoint_lut: 0.741 sec, 1.2 cycles  (check: -1022918656)
hextoint_cond: 1.925 sec, 3.0 cycles  (check: -1022918656)
hextoint_cond2: 1.660 sec, 2.6 cycles  (check: -1022918656)
hextoint_bit: 1.400 sec, 2.2 cycles  (check: -1022918656)

GCC 4.8.3 x64 -O3 -fno-tree-vectorize
hextoint_lut: 0.702 sec, 1.1 cycles  (check: -1114112000)
hextoint_cond: 1.513 sec, 2.4 cycles  (check: -1114112000)
hextoint_cond2: 2.543 sec, 4.0 cycles  (check: -1114112000)
hextoint_bit: 1.544 sec, 2.4 cycles  (check: -1114112000)

GCC 4.8.3 x64 -O3
hextoint_lut: 0.702 sec, 1.1 cycles  (check: -1114112000)
hextoint_cond: 0.717 sec, 1.1 cycles  (check: -1114112000)
hextoint_cond2: 0.468 sec, 0.7 cycles  (check: -1114112000)
hextoint_bit: 0.577 sec, 0.9 cycles  (check: -1114112000)

Clearly, LUT approach takes one cycle per value (as predicted). The other approaches normally take from 2.2 to 2.6 cycles per value. In case of GCC, hextoint_cond2 is slow because compiler uses cmp+sbb+and magic instead of desired cmov instructions. Also note that by default GCC vectorizes most of the approaches (last paragraph), which provides expectedly faster results than the unvectorizable LUT approach. Note that manual vectorization would give significantly greater boost.

Discussion

Note that hextoint_cond with ordinary conditional jump instead of cmov would have a branch. Assuming random input hex digits, it will be mispredicted almost always. So performance would be terrible, I think.

I have analysed throughput performance. But if we have to process tons of input values, then we should definitely vectorize the conversion to get better speed. hextoint_cond can be vectorized with SSE in a pretty straightforward way. It allows to process 16 bytes to 16 bytes by using only 4 instructions, taking about 2 cycles I suppose.

Note that in order to see any performance difference, you must ensure that all the input values fit into cache (L1 is the best case). If you read the input data from main memory, even std::atoi is equally fast with the considered methods =)

Also, you should unroll your main loop 4x or even 8x for maximum performance (to remove looping overhead). As you might have already noticed, the speed of both methods highly depends on which operations are surrounding the code. E.g. adding a memory load doubles time taken by the first approach, but does not influence the other approaches.

P.S. Most likely you don't really need to optimize this.

in your conditional version you re doing one subtraction unnecessarily. This reduces to 3uop . By the way I generated random strings and still no problem. How was your assembly generated ? — g24l, Dec 21 '15 at 03:12
@g24l: I used MSVC2013 to compile the functions. I guess you mean we can do a ternary operator on 39 and 87 and subtract either of them? In such case one subtraction will be replaced with mov with immediate. Seems to be 4 uops (with mov being free), why 3? — stgatilov, Dec 21 '15 at 04:20
@g24l: I have added a version with single subtraction. And a version with bit magic. Bottom line is clear: approaches without LUT do about 5 uops, which should be about 5/3 cycles throughput. — stgatilov, Dec 21 '15 at 05:16
Your analysis is very interesting indeed. I'll try to write a few tests. I also had thought about the bitmagic version (nice trick indeed), but I think it would not make much of a difference. As you said, the code surrounding may be more important. — g24l, Dec 21 '15 at 13:46
In the first version, `lea` before `sub` would shorten the critical path latency: `lea`, `sub`, and `cmp` could all run in parallel to feed `cmov`. Also, Broadwell/Skylake would benefit if you used `cmovb` or `cmovae` (1 uop) instead of `cmovbe` (2 uops because it needs to read ZF and CF which are renamed separately: recent Intel avoids flag merging by having instructions only read the parts of FLAGS they need.) https://uops.info/ has counts. Anyway, adjust your `cmp` constant by one, unless that's compiler output which you can't fix manually. — Peter Cordes, Mar 28 '20 at 20:01

olegarch · Answer 4 · 2016-01-03T05:30:26.773

6

This is my favorite hex-to-int code:

inline int htoi(int x) {
    return 9 * (x >> 6) + (x & 017);
}

It is case-insensitive for letter, i.e will return correct result for "a" and "A".

edited Jan 03 '16 at 05:30

answered Jan 03 '16 at 04:42

olegarch

3,670
1
20
19

score 3 · Answer 5 · edited May 23 '17 at 12:02

3

Assuming that your function is called for a valid hex digit, it will cost in average at least 8 comparison operations (and perhap's 7 jumps). Pretty expensive.

An alternative would be the more compact:

if (number >= '0' && number<='9') 
    return number-'0';
else if (number >= 'a' && number <='f') 
    return number-'a'+0x0a; 
else return -1;

Yet another alternative would be to use a lookup table (trade space against speed), that you would initialise only once, and then access directly:

if (number>=0) 
   return mytable[number]; 
else return -1;

If you want to convert more than one digit at a time, you could have a look at this question)

Edit: benchmark

Following Ike's observations, Ive written a small informal benchmark (available online here), that you can run on your favourite compiler.

Conclusions:

The lookup table is always the winner
The switch better than the if-chain.
With msvc2015 (release), the second best is my compact version, surprisingly followed closely by the map version of 101010.
With gcc on ideone, the second is the switch version followed by the compact version.

edited May 23 '17 at 12:02

Community

1
1

answered Dec 19 '15 at 00:16

Christophe

68,716
7
72
138

1

Why would any compiler emit code that required all those conditionals? `number` obviously doesn't change. I would expect a lookup table to be created. Of course, actually writing a `switch` would be best. – Lightness Races in Orbit Dec 19 '15 at 00:33
1

@LightnessRacesinOrbit Just out of curiosity, I tried these: (bunch of ifs: https://goo.gl/FJWn0Y) (switch: https://goo.gl/xUPJaF) (range check: https://goo.gl/gIeK2y). Also if anyone wants to use these in an answer (ideally with a benchmark), feel free. – Dec 19 '15 at 00:48
1

@Ike Interesting to see that the switch generated the code equivalent to the table approach ! I'd expected that it would have generated a jump table but with a movl and a ret for each entry. – Christophe Dec 19 '15 at 01:15
1

@Christophe That one most surprised me too that it turned it into a compact LUT. I would guess it's the most performant one among these, though I'm not computer architecture-savvy enough to predict with total confidence without at least profiling it against a careful benchmark. – Dec 19 '15 at 01:18
1

Every time a programmer assumes ASCII encoding, god kills a kitten. ;-) – DevSolar Dec 20 '15 at 22:15
@DevSolar yes, and even worse when some assume that char is always unsigned ;-) But you're fundamentally right: I felt very bad everytime when I typed 128 in my naive benchmark; I wanted to spare an unsigned conversion... Nevertheless, in the `char` world, EBCDIC is not so common anymore and the code works with ANSI, ISO 8859-X and UTF-8. – Christophe Dec 20 '15 at 22:42
Speaking of the fast approaches (the last three), they are much faster than you show them. I suppose you have actually measured performance of Mersenne Twister random number generator, judging from your code. Move random number generation out of time measurement, you'll get much lower times. – stgatilov Dec 21 '15 at 07:38
Also, you pass `number = (char)0` as input value, because `uniform_int_distribution` is inclusive from 0 to 16. Is it intentional? – stgatilov Dec 21 '15 at 07:44
@stgatilov the goal was to compare the speed, so adding the same twister overhead to all doesn't change the conclusion. To get the speed, minimizing measurement overhead, i would have build a huge array of precalculated randoms, but taking care that os doesn't swap in and out too much. – Christophe Dec 21 '15 at 08:11

score 2 · Answer 6 · answered Dec 28 '17 at 18:09

In case you (or someone else) are actually converting an array of values, I made an AVX2 SIMD encoder and decoder that benchmarks ~12x faster than the fastest scalar implementation: https://github.com/zbjornson/fast-hex

The 16 hex values conveniently fit (twice) into a YMM register, so you can use PSHUFB to do a parallelized lookup. Decoding is a bit harder and based on bit-wise ops.