Is there an algorithm to convert massive hex string to bytes stream QUICKLY? asm/C/C++

Question

Here is my current code:

//Input:hex string , 1234ABCDEEFF0505DDCC ....
//Output：BYTE stream
void HexString2Hex(/*IN*/ char* hexstring, /*OUT*/  BYTE* hexBuff)
{
    for (int i = 0; i < strlen(hexstring); i += 2)
    {
        BYTE val = 0;
        if (hexstring[i] < 'A')
            val += 0x10 * (hexstring[i] - '0');
        else
            val += 0xA0 + 0x10 * (hexstring[i] - 'A');

        if (hexstring[i+1] < 'A')
            val += hexstring[i + 1] - '0';
        else
            val += 0xA + hexstring[i + 1] - 'A';

        hexBuff[i / 2] = val;
    }
}

the problem is: when the input hex string is very big (such as 1000000 length), this function will take hundred seconds which is unacceptable for me. (CPU: i7-8700,3.2GHz. Memory:32G)

So, is there any alternative algorithms to do the work more quickly?

Thank you guys

Edit1: thank paddy's comment. I was too careless to notice that strlen( time:O(n)) was executed hundreds times. so my original function is O(n*n) which is so terrible.

updated code is below:

int len=strlen(hexstring);
for (int i = 0; i < len; i += 2)

And, for Emanuel P 's suggestion, I tried ,it didn't seems good. the below is my code

map<string, BYTE> by_map;

//init table (map here)
char *xx1 = "0123456789ABCDEF";
    for (int i = 0; i < 16;i++)
    {
        for (int j = 0; j < 16; j++)
        {
            
            _tmp[0] = xx1[i];
            _tmp[1] = xx1[j];

            BYTE val = 0;
            if (xx1[i] < 'A')
                val += 0x10 * (xx1[i] - '0');
            else
                val += 0xA0 + 0x10 * (xx1[i] - 'A');

            if (xx1[j] < 'A')
                val += xx1[j] - '0';
            else
                val += 0xA + xx1[j] - 'A';

            by_map.insert(map<string, BYTE>::value_type(_tmp, val));
        }
    }

//search map
void HexString2Hex2(char* hexstring, BYTE* hexBuff)
{
    char _tmp[3] = { 0 };
    for (int i = 0; i < strlen(hexstring); i += 2)
    {
        _tmp[0] = hexstring[i];
        _tmp[1] = hexstring[i + 1];
        //DWORD dw = 0;
        //sscanf(_tmp, "%02X", &dw);
        hexBuff[i / 2] = by_map[_tmp];
    }
}

Edit2: In fact, my problem is solved when I fix the strlen bug. Below is my final code:

void HexString2Bytes(/*IN*/ char* hexstr, /*OUT*/  BYTE* dst)
{
    static uint_fast8_t LOOKUP[256];
    for (int i = 0; i < 10; i++)
    {
        LOOKUP['0' + i] = i;
    }
    for (int i = 0; i < 6; i++)
    {
        LOOKUP['A' + i] = 0xA + i;
    }

    for (size_t i = 0; hexstr[i] != '\0'; i += 2)
    {
        *dst = LOOKUP[hexstr[i]] << 4 |
            LOOKUP[hexstr[i + 1]];
        dst++;
    }
}

Btw, sincerely thank you guys. You are awesome! real researchers!

You could start by not calling `strlen` every single iteration. — paddy, Apr 12 '21 at 07:27
Are you wanting to use Intel SIMD intrinsics like `_mm_cmpgt_epi8` or `_mm_shuffle_epi8` to process 16 or 32 bytes at once? That's [very effective for int->hex string](https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string/66518284#66518284), and can be effective for the other direction. — Peter Cordes, Apr 12 '21 at 07:30
In addition to paddy's comment, you may also want to consider using a lookup table for the characters instead of conditional branching. — Emanuel P, Apr 12 '21 at 07:36
After hoisting `strlen`, the amount of work per iteration (per 2 hex digits) is still pretty significant here, and GCC -O3 output (https://godbolt.org/z/8Tnvdsoc4) includes a conditional branch on `hexstring[i + 1]` being alphabetic or not. So it can even mispredict, which is probably slower than branchless ify our data is uniformly distributed. (And BTW, compilers can't hoist `strlen` because `BYTE *` is probably `unsigned char*` and thus can alias anything, as can the input pointer. With `restrict`, GCC might hoist it for you... but doesn't https://godbolt.org/z/7ETnKW77z) — Peter Cordes, Apr 12 '21 at 07:37
Also, don't call your output `hexBuff`. It's *not* pairs of base 16 digits anymore, that's the whole point. It's packed *binary*, or a byte stream. That's just even more confusing than usual. — Peter Cordes, Apr 12 '21 at 07:39
You don't need `val += 0x10 * ...` You can just plug in the hex digit in the byte stream at the same position as it was in the char string. — Paul Ogilvie, Apr 12 '21 at 07:43
The `std::map` idea is nice, but not very fast either. Constructing a `std::string` and then doing an alphabetic search is a tad slow. The lookup table really ought to be a `char nibbles[UCHAR_MAX]`. CPU's are very good at that sort of pointer arithmetic — MSalters, Apr 12 '21 at 07:51
The standard implementation of `std::map<>;` is a Red-Black tree, so searching that is already going to take multiple compares. Did you maybe want `std::unordered map`? Also, equally or more importantly, `` is pretty crazy. If you want 2-byte pairs of chars as keys, use `uint16_t` or maybe `std::pair`, not a variable-length container like `std::string` that has to do a lot of work to handle the possibility of comparing different-length strings. — Peter Cordes, Apr 12 '21 at 08:05
@Msalter. Yes, that's what I meant. The array should just contain the hex values indexed by ascii character. And then you shift two chars at once to get a byte, eg `byte b = (nibbles[char0] << 4) | nibbles[char1];` — Emanuel P, Apr 12 '21 at 08:07
@dAEMON9527 Once you removed the obvious, it may pay off to play with the optimization and target architecture options of the compiler you are using. — dxiv, Apr 12 '21 at 08:11
I would use `strtoull` to process 16 input bytes at a time, producing 8 output bytes per call. — user3386109, Apr 12 '21 at 08:16
So I posted a C answer but now the C tag is gone. Oh well, the most efficient C code is usually also the most efficient C++ code... except C++ might go muppet if you use designated initializers. — Lundin, Apr 12 '21 at 08:23
@Lundin: That was my doing; since the question included `std::map` it wasn't a C question anymore. But since that invalidated an answer, I put it back, since I didn't end up tagging any more tags, like performance or micro-optimization. — Peter Cordes, Apr 12 '21 at 09:30
x86 SIMD hex->binary: https://github.com/zbjornson/fast-hex, also SO [Converting a hex string to a byte array](https://stackoverflow.com/q/17261798) — Peter Cordes, Apr 12 '21 at 09:30
@dAEMON9527: please don't edit an answer into the question. If you want to post your version, post it as an *answer*. (Where people can comment to point out that running the init code every time it's invoked makes it pointless to use a `static` table. Pick one or the other, not the worst of both worlds.) — Peter Cordes, Apr 12 '21 at 18:15

Lundin · Answer 1 · 2021-04-12T08:15:57.470

5

The standard way to create the most efficient code possible (at the cost of RAM/ROM) is to use look-up tables. Something like this:

static const uint_fast8_t LOOKUP [256] =
{
  ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
  ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
  ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
  ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
};

This sacrifices 256 bytes of read-only memory and in turn we don't have to do any form of arithmetic. The uint_fast8_t lets the compiler pick a larger type if it thinks that will help performance.

The full code would then be something like this:

void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
  static const uint_fast8_t LOOKUP [256] =
  {
    ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
    ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
    ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
    ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
  };
  
  for(size_t i=0; hexstr[i]!='\0'; i+=2)
  {
    *dst = LOOKUP[ hexstr[i  ] ] << 4 |
           LOOKUP[ hexstr[i+1] ];
    dst++;
  }
}

This boils down to some ~10 instructions when tested on a x86_64 (Godbolt). Branch-free apart from the loop condition. Notably there's no error checking what so ever, so you'd have to ensure that the data is OK (and contains an even amount of nibbles) elsewhere.

Test code:

#include <stdio.h>
#include <stdint.h>

void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
  static const uint_fast8_t LOOKUP [256] =
  {
    ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
    ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
    ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
    ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
  };
  
  for(size_t i=0; hexstr[i]!='\0'; i+=2)
  {
    *dst = LOOKUP[ hexstr[i  ] ] << 4 |
           LOOKUP[ hexstr[i+1] ];
    dst++;
  }
}

int main (void)
{
  const char hexstr[] = "DEADBEEFC0FFEE";
  uint8_t bytes [(sizeof hexstr - 1)/2];
  hexstr_to_bytes(hexstr, bytes);
  
  for(size_t i=0; i<sizeof bytes; i++)
  {
    printf("%.2X ", bytes[i]);
  }
}

edited Apr 12 '21 at 08:15

answered Apr 12 '21 at 08:06

Lundin

195,001
40
254
396

1

When using look-up tables, he could also use a look-up table of 64K size and treat the string as array of `uint16_t` (each `uint16_t` element representing two HEX digits...). – Martin Rosenau Apr 12 '21 at 08:08
2

Another alternative would be to use two arrays; one with the value already shifted. This would spare a shift in the loop without using 64k mem. – Emanuel P Apr 12 '21 at 08:11
@MartinRosenau Yeah, but then you have to generate that huge lookup in advance, depending on your CPU endianess. 256 bytes is a reasonable compromise between space and memory too. 64kib memory sacrificed just to save a few CPU ticks might be questionable. – Lundin Apr 12 '21 at 08:12
1

@MartinRosenau: Yes you could, but would that be faster? Possibly; you'd have 16 or 32 "hot" cache lines, scattered with a 256-byte stride between them. Expensive to warm up, but if used for a huge buffer then would probably stay hot in cache enough to be a win for a big 1MiB buffer. Still, if you're going to that level of effort, you might just write SIMD intrinsics implementations for some ISAs you care about like x86 and ARM, and go even faster still, producing better than 1 byte per cycle of output, maybe 4B / c with SSE4 on a modern x86 (haven't fully thought through a design though.) – Peter Cordes Apr 12 '21 at 08:15
In one case you have to initialize 32 of 512 bytes in the table (if you want to save the multiplication by 16), perform two look-ups and an addition for each output byte; in the other case you have to initialize 256 of 64K bytes. The OP seems to assume that there are no "invalid" characters in the input so the corresponding bytes may be left uninitialized. And of course the last element in the look-up-table being accessed is `17*'F'`, so the look-up-table needs only to be `17*`F`+1` bytes long. As I understood the OP correctly, memory is not a problem. – Martin Rosenau Apr 12 '21 at 08:17
@PeterCordes That's interesting, I didn't consider cache lines much but if you know that level of detail of the target system, then maybe it would benefit to set the size of the lookup table to suite a cache line? This could get cut down to 128 bytes, or even 70 bytes. And the `uint_fast8_t` could be ditched in favour for exact-width `uint8_t`. – Lundin Apr 12 '21 at 08:18
On the other end of the CPU spectrum, you have low-end microcontrollers where this code will work fine too. They don't care much about alignment and 256 bytes of flash is OK to sacrifice in most cases. Mid-range MCUs might have flash wait states and then it may or may not be more efficient to upload the table into RAM first, depending on amount of data. – Lundin Apr 12 '21 at 08:19
@PeterCordes This is what I was also thinking about. However, because x86 PCs are not like home computers in the 1980s (every "Commodore 64" had exactly the same CPU, the same RAM, the same circuits...), the OP will have to test different approaches to check which one is the fastest one on his computer. My idea using `uint16_t` is just one more approach to be tested. – Martin Rosenau Apr 12 '21 at 08:22
Most CPUs have decently fast byte-load instructions, and most C libraries choose `uint_fast8_t` to be 1 byte. (Unlike glibc's silly choice of `uint_fast32_t` = `uintptr_t` on x86-64...). For performance, the size of the table only matters while you init it. Lines you never touch can stay cold, not polluting anything. One should normally tune for the no-error case, so that means only worrying about `0..9` and `A..F`. `'f' - '0'` is 54, so `c -= '0'` to shift the base of the table would put all the used entries in one 64B line. (instead of split between lo/hi 64 http://www.asciitable.com/) – Peter Cordes Apr 12 '21 at 08:23
@MartinRosenau: That's the thing, though: microbenchmarking *will* probably find the 65536 byte table faster because you use few enough lines from it for them to stay hot. But does that mean it's good for *part* of a larger program? Maybe not; getting the work done with fewer cache evictions (from touching less table memory) might be better. OTOH if you're looping over a 1MiB input and writing 512k output, you're blowing away L2 and L1d anyway, and about 32 more lines of L3 might not matter. Still, faster in a microbenchmark isn't a good test when you're trading off cache footprint vs. ALU. – Peter Cordes Apr 12 '21 at 08:28
@PeterCordes Yeah, though the OP didn't mention if it needs to handle lower case hex also. The table size can get fine-tuned a bit, though if we have to subtract an offset from the data we lose a tiny bit of performance. – Lundin Apr 12 '21 at 08:28
On a low-end microcontroller, you could only store `tableX[23]="0123...CDEF";` which is even only 23 bytes of Flash. Then you use a linker symbol `table = tableX - N` choosing N in a way that `&(table['0']) = &(tableX[0])`. This will cost you only 23 bytes. Using the `uint16_t` approach you woud need 392 bytes if I didn't make a mistake. – Martin Rosenau Apr 12 '21 at 08:31
If you don't actually range-check, the subtraction only costs code-size in the addressing mode (for x86 at least). Or not at all if it gets folded in at the point where the static table address is calculated, i.e. load the address `lut-'0'` into a register, or in x86 use it directly as the 32-bit displacement part of an addressing mode. – Peter Cordes Apr 12 '21 at 08:33
@Lundin: you commented on Martin's now-deleted answer that const-correctness would let the compiler hoist `strlen`. That's not the case: https://godbolt.org/z/bE7dr3M9b. `const char *in` doesn't mean it's illegal for the caller to pass a `BYTE *out` that overlaps with it. It just means you'd have to cast away `const` if you wanted to use a pointer derived from `in` to store. What should matter is C99 `restrict`, which I also used by to no avail. That *should* make it legal for GCC to hoist strlen but it misses that optimization. – Peter Cordes Apr 12 '21 at 08:36
@MartinRosenau: How do you get only 392 bytes of LUT space for a pair of bytes? `"FF"` is `0x4646`, and is `"00"` is 0x3030. Those offsets are 5654 bytes apart so that's your min table size handling only *upper* case letters. `"ff"` is 0x6666, for a table size of over 13.5 kiB. (Or are you still talking about which 64-byte chunks of that are actually hot in cache?) – Peter Cordes Apr 12 '21 at 08:44
@PeterCordes You are right. I did a simple calculation mistake. It takes 5654 bytes of Flash, not 392... – Martin Rosenau Apr 12 '21 at 09:12
1

Without a `(unsinged char)`, `LOOKUP[ hexstr[i ] ]` risks UB when `hexstr[i ] < 0`. – chux - Reinstate Monica Apr 12 '21 at 10:25
1

`for(size_t i=0; hexstr[i]!='\0'; i+=2)` is UB with odd length string. (Now see note in text, but not in code) – chux - Reinstate Monica Apr 12 '21 at 10:27

score 3 · Answer 2 · answered Apr 19 '21 at 20:53

when the input hex string is very big (such as 1000000 length)

Actually, 1 meg isn't all that long for today's computers.

If you need to be able to handle bigger strings (think 10s of gigabytes), or even just a LOT of 1 meg strings, you can play with the SSE functions. While it will work for more modest requirements, the added complexity may not be worth the performance gain.

I'm on Windows, so I'm building with MSVC 2019. x64, optimizations enabled, and arch:AVX2.

#define _CRT_SECURE_NO_WARNINGS
typedef unsigned char BYTE;

#include <stdio.h>
#include <memory.h>
#include <intrin.h>
#include <immintrin.h>
#include <stdint.h>

static const uint_fast8_t LOOKUP[256] = {
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };

void HexString2Bytes(/*IN*/ const char* hexstr, /*OUT*/  BYTE* dst)
{
    for (size_t i = 0; hexstr[i] != '\0'; i += 2)
    {
        *dst = LOOKUP[hexstr[i]] << 4 |
            LOOKUP[hexstr[i + 1]];
        dst++;
    }
}

void HexString2BytesSSE(const char* ptrin, char *ptrout, size_t bytes)
{
    register const __m256i mmZeros = _mm256_set1_epi64x(0x3030303030303030ll);
    register const __m256i mmNines = _mm256_set1_epi64x(0x0909090909090909ll);
    register const __m256i mmSevens = _mm256_set1_epi64x(0x0707070707070707ll);
    register const __m256i mmShuffle = _mm256_set_epi64x(-1, 0x0f0d0b0907050301, -1, 0x0f0d0b0907050301);

    //============

    const __m256i* in = (const __m256i*)ptrin;
    __m128i* out = (__m128i *)ptrout;
    size_t lines = bytes / 32;

    for (size_t x = 0; x < lines; x++)
    {
        // Read 32 bytes
        __m256i AllBytes = _mm256_load_si256(in);

        // subtract '0' from every byte
        AllBytes = _mm256_sub_epi8(AllBytes, mmZeros);

        // Look for bytes that are 'A' or greater
        const __m256i mask = _mm256_cmpgt_epi8(AllBytes, mmNines);

        // Assign 7 to every byte greater than 'A'
        const __m256i maskedvalues = _mm256_and_si256(mask, mmSevens);

        // Subtract 7 from every byte greater than 'A'
        AllBytes = _mm256_sub_epi8(AllBytes, maskedvalues);

        // At this point, every byte in AllBytes represents a nibble, with
        // the even bytes being the upper nibble.

        // Make a copy and shift it left 4 bits to shift the nibble, plus
        // 8 bits to align the nibbles.
        __m256i UpperNibbles = _mm256_slli_epi64(AllBytes, 4 + 8);

        // Combine the nibbles
        AllBytes = _mm256_or_si256(AllBytes, UpperNibbles);

        // At this point, the odd numbered bytes in AllBytes is the output we want.

        // Move the bytes to be contiguous.  Note that you can only move
        // bytes within their 128bit lane.
        const __m256i ymm1 = _mm256_shuffle_epi8(AllBytes, mmShuffle);

        // Move the bytes from the upper lane down next to the lower.
        const __m256i ymm2 = _mm256_permute4x64_epi64(ymm1, 8);

        // Pull out the lowest 16 bytes
        *out = _mm256_extracti128_si256(ymm2, 0);

        in++;
        out++;
    }
}

int main()
{
    FILE* f = fopen("test.txt", "rb");

    fseek(f, 0, SEEK_END);
    size_t fsize = _ftelli64(f);
    rewind(f);

    // HexString2Bytes requires trailing null
    char* InBuff = (char* )_aligned_malloc(fsize + 1, 32);

    size_t t = fread(InBuff, 1, fsize, f);
    fclose(f);

    InBuff[fsize] = 0;

    char* OutBuff = (char*)malloc(fsize / 2);
    char* OutBuff2 = nullptr;

    putchar('A');

    for (int x = 0; x < 16; x++)
    {
        HexString2BytesSSE(InBuff, OutBuff, fsize);
#if 0
        if (OutBuff2 == nullptr)
        {
            OutBuff2 = (char*)malloc(fsize / 2);
        }
        HexString2Bytes(InBuff, (BYTE*)OutBuff2);
        if (memcmp(OutBuff, OutBuff2, fsize / 32) != 0)
            printf("oops\n");
        putchar('.');
#endif
    }

    putchar('B');

    if (OutBuff2 != nullptr)
        free(OutBuff2);
    free(OutBuff);
    _aligned_free(InBuff);
}

A couple of things to notice:

There is no error handling here. I don't check for out of memory, or file read errors. I don't even check for invalid characters in the input stream or lower case hex digits.
This code assumes that the size of the string is available without having to walk the string (ftelli64 in this case). If you need to walk the string byte-by-byte to get its length (a la strlen), you've probably lost the benefit here.
I've kept HexString2Bytes, so you can compare the outputs from my code vs yours to makes sure I'm converting correctly.
HexString2BytesSSE assumes the number of bytes in the string is evenly divisible by 32 (a questionable assumption). However, reworking it to call HexString2Bytes for the last (at most) 31 bytes is pretty trivial, and isn't going to impact performance much.
My test.txt is 2 gigs long, and this code runs it 16 times. That's about what I need for the differences to become readily visible.

For people who want to kibitz (because of course you do), here's the assembler output for the innermost loop along with some comments:

10F0  lea         rax,[rax+10h]   ; Output pointer
10F4  vmovdqu     ymm0,ymmword ptr [rcx] ; Input data
10F8  lea         rcx,[rcx+20h]

; Convert characters to nibbles
10FC  vpsubb      ymm2,ymm0,ymm4  ; Subtract 0x30 from all characters
1100  vpcmpgtb    ymm1,ymm2,ymm5  ; Find all characters 'A' and greater
1104  vpand       ymm0,ymm1,ymm6  ; Prepare to subtract 7 from all the 'A' 
1108  vpsubb      ymm2,ymm2,ymm0  ; Adjust all the 'A'

; Combine the nibbles to form bytes
110C  vpsllq      ymm1,ymm2,0Ch   ; Shift nibble up + align nibbles
1111  vpor        ymm0,ymm1,ymm2  ; Combine lower and upper nibbles

; Coalesce the odd numbered bytes
1115  vpshufb     ymm2,ymm0,ymm7

; Since vpshufb can't cross lanes, use vpermq to
; put all 16 bytes together
111A  vpermq      ymm3,ymm2,8

1120  vmovdqu     xmmword ptr [rax-10h],xmm3
1125  sub         rdx,1
1129  jne         main+0F0h (10F0h)

While your final code is almost certainly sufficient for your needs, I thought this might be interesting for you (or future SO users).

Another SIMD implementation (https://github.com/zbjornson/fast-hex), uses `vpmaddubsw` (`_mm256_maddubs_epi16`) to horizontally combine nibbles from adjacent bytes, instead of shift/OR. That might nicely set up for `vpackusswb` + `vpermq` to combine two 32-byte vectors down to one. Or after your `vpshufb`, you could use `vpunpcklqdq` + `vpermq` to create a 32-byte vector of output with 4 total shuffles. (Hrm, that's still 2 shuffles per 128-bit of output, only saving 1 store.) — Peter Cordes, Apr 19 '21 at 21:36
Repeated runs over a 1MiB buffer should benefit even more from SIMD than larger buffers. Fitting in L3 cache, or even L2, can keep SIMD instructions fed with enough data to let them run at full speed, showing off the full advantage over scalar byte-at-a-time lookup tables. — Peter Cordes, Apr 19 '21 at 21:38
Oh actually, ZachB's fast-hex is using `pmaddubsw` to implement decimal value = `9 * (ascii >> 6) + (ascii & 0xf);` per each hex digit. That's a clever mapping function for scalar, but it's probably not worth the cost of unpacking bytes to words, vs. what you're doing with a normal compare. What I suggested in my first comment would be a drop-in replacement for your way that should be better, saving 1 uop at the cost of latency which OoO exec can hide. — Peter Cordes, Apr 19 '21 at 21:50
@PeterCordes - So I tried pulling zbjornson's code into my sample. While it's giving me the same answer, it's a (tiny) bit slower than mine (yay!). As for replacing vpsllq+vpor with maddubs, it is indeed a (very tiny) bit faster yet. I tried playing with `_mm256_unpacklo_epi64(ymm1, ymm1)`, but I'm not seeing how this would work. — David Wohlferd, Apr 20 '21 at 04:34
You might see bigger differences for more repeat-loop iterations around a smaller conversion (that fits in L2 or even L1d). Re: `_mm256_unpacklo_epi64` - not with the same arg twice, I meant with two *different* args, as a way to combine low qwords from 2 vectors, then `vpermq` to make a full 32-byte vector. But like I said, that needs a `vpshufb` for each input first so it's worse than using `vpackuswb` + `vpermq` to combine vectors, especially if there isn't high garbage in each word that needs clearing after `vpmaddubsw` — Peter Cordes, Apr 20 '21 at 04:39
@PeterCordes "L2" - Yeah, but that'd just be silly(er). I don't see any way to get 2 vectors here while maintaining the sequence, so I'm still not seeing a way to use unpack. But: What about using _mm256_maskstore_epi64 (combining `vpermq` + `vmovdqu` to 1 instruction)? If I shuffle the bytes so that they're stored in the *middle* 128 bits, then I can set the mask to skip writing the first and last i64. It does mean passing a pointer that is *before* the start of the buffer, then setting the mask says not to write there. The Intel docs say this should be legal, but it feels like UB. — David Wohlferd, Apr 20 '21 at 07:39
Real use-cases can certainly include converting 32-hex-digit MD5 hashes or other relatively short things; it's hardly silly to consider performance for functions that get called *many* times for shorter buffers over the course of a program. — Peter Cordes, Apr 20 '21 at 07:44
I was suggesting shuffling together 2 vectors with an in-lane shuffle (because AVX1/2 suck and don't have AVX-512's `vpermt2q`), then you use `vpermq` on the result to get all the 8-byte chunks into the right final order. Yeah, a masked store is a neat idea; it's technically UB in ISO C to even form a pointer that doesn't point within (or to one-past-end) of an array, but in practice real-world implementations that support Intel intrinsics de-facto define the behaviour. Fault-suppression of masked stores makes it safe, although it can be slow if fault-suppression is actually needed. — Peter Cordes, Apr 20 '21 at 07:49
Unfortunately `vpmaskmovq` stores are quite slow on Zen/Zen2, so not a great plan for portable performance. (Search for AVX2: `vpmaskmov (M` on https://uops.info/)). On SKX / ICL they cost 3 uops for the front-end: p0 ALU, plus store-address and store-data uops (which don't micro-fuse). Only AVX-512 has single-uop masked stores. It avoids the port 5 shuffle bottleneck, but is worse than shuffling for front-end throughput. (And even then, if your output pointer was at the start of a page, and the prev page unmapped, then hundreds of cycles microcode assist IIRC.) — Peter Cordes, Apr 20 '21 at 07:53
I'm on Kaby Lake myself, and vpmaskmovq works (slightly) better than vpermq + vmovdqu. Per the Intel docs: *For example, no faults will be detected if the mask bits are all zero*, so there aren't supposed to be any faults if used correctly, even with "out of bounds" addresses. It's not clear what OP's HW requirements are (probably just a school project), so I may just leave what's in the answer alone. If some future user needs even better performance, look at replacing vpsllq+vpor with maddubs, and vpermq + vmovdqu with vpmaskmovq. — David Wohlferd, Apr 20 '21 at 08:11

score 1 · Answer 3 · answered Apr 12 '21 at 07:34

1

Maybe a switch is (marginally) faster

switch (hexchar) {
    default: /* error */; break;
    case '0': nibble = 0; break;
    case '1': nibble = 1; break;
    //...
    case 'F': case 'f': nibble = 15; break;
}

answered Apr 12 '21 at 07:34

pmg

106,608
13
126
198

Not my downvote, but this is quite possibly slower. Modern CPU's (i.e. this century) are a sensitive to branching, so much that good compilers will try to spot switches like these and replace them by lookup tables. But why rely on compilers if you can just write that lookup table? – MSalters Apr 12 '21 at 07:46
@MSalters: Indeed, this would be garbage if it compiled to a jump table like I was afraid it would. Fortunately GCC `-O2` *is* smart enough to compile it to a lookup table (with a range-check first), at least in the version I wrote with only upper-case letters handled like the source, although that probably just makes the table smaller. (Sometime between GGC4.1 and GCC4.4 according to Godbolt. https://godbolt.org/z/rfb9x51Gf). For code-style, yeah I'd agree writing a LUT manually would be better, like `c -= '0'` / `if (c <= 'f' - '0') ...`, especially if you make ASCII assumptions. – Peter Cordes Apr 12 '21 at 07:52
If you try to avoid ASCII assumptions, you might use an array initializer using designated-initializer syntax like `uint8_t lut[] = { 0, ['1' - '0'] = 1, ..., ['f' - '0'] = 15 };`. Which is maybe even more involved than a switch, so this is really not that bad *if* your compiler is smart enough, otherwise expect a branch mispredict on most characters unless the data has simple patterns. – Peter Cordes Apr 12 '21 at 07:56
BTW, previous Godbolt link was using `unsigned int` output: I forgot to put it back to `unsigned char` after testing something. https://godbolt.org/z/an3hvecT4 is GCC10.3 -O3 with this switch turned into a data lookup table. – Peter Cordes Apr 12 '21 at 08:39

prehistoricpenguin · Answer 4 · 2021-04-12T08:19:17.803

1

Boost already has a unhex algorithm implementation, you may compare the benchmark result as a baseline:

unhex ( "616263646566", out )  --> "abcdef"
unhex ( "3332", out )          --> "32"

If your string is very huge, then you may consider some parallel approach (using threads based framework like OpenMP, parallel STL)

edited Apr 12 '21 at 08:19

answered Apr 12 '21 at 08:02

prehistoricpenguin

6,130
3
25
42

2

The OP's string is only about 1MiB in length; barely worth starting threads for unless you have lots of this to do (on multiple buffers / repeatedly). A well-tuned unhex, especially with SIMD intrinsics, should produce multiple bytes per clock cycle of output. – Peter Cordes Apr 12 '21 at 08:07
1

@PeterCordes This is true, for string size around 1MB, it's not worth using threads. So I said for very huge strings. SIMD is always the first choice for string processing. – prehistoricpenguin Apr 12 '21 at 08:11

score 0 · Answer 5 · answered Apr 16 '21 at 02:18

0

Direct answer: I do not know the perfect Algorithm. x86 asm: Per Intel performance guide- Unwind the loop. Try the XLAT instruction(2 different tables needed)[eliminates conditional branches]. Modify the call interface to include explicit block length as the caller should know the string length[eliminate strlen() ]. Test the output array space for large enough: minor bug- remember that an odd length divided by two is rounded down. Therefore if odd length of source, initialize last byte of output (only). Change return to type int from void so you can pass error or success codes and length processed. Handle null length input. Advantage of doing in blocks is the practical limit becomes the OS file size limit. Try setting thread affinity. I suspect the limitation on performance ultimately is RAM to CPU bus, depending. If so, try to do data fetch and stores on largest bit width supported by RAM. Bench test with no optimize and higher levels if coding in c or c++. Test validity by doing reverse process followed by byte for byte compare (non-zero chance CRC-32 misses). Possible problem with PBYTE- use native c unsigned char type. There is a to be tested trade off between code size and L1 - number of cache misses vs how much loop unwound. In asm use cx/ecx/rcx to count down (rather than the usual count up and compare). SIMD is also possible assuming CPU support.

answered Apr 16 '21 at 02:18

sys101

19
3

You're unlikely to bottleneck on memory bandwidth with scalar byte-at-a-time code, even using a lookup table. You're not going to go faster than 2 bytes of input plus 1 byte of output per 2 clock cycles, so that's only about 1.5 times clock speed GB/sec, or 6GB/s for a 4GHz CPU. That might be close to the single-core bandwidth available on a big Xeon (especially while other cores are also busy), but if it's the only thing on a modern desktop, memory bandwidth is more like 40GB/s. – Peter Cordes Apr 16 '21 at 04:16
1

`xlat` is useless here; it costs 2 uops on AMD (including Zen), 3 on Intel Sandybridge-family (https://uops.info/, https://agner.org/optimize/), so it's no cheaper than `movzx edx, al` / `mov al, [rbx + rdx]` (or whatever other temp reg instead of RDX). And doing it manually, you can often avoid that initial `movzx` by arranging for your byte value to already be zero-extended, e.g. from a `movzx` load from memory, or a right shift. – Peter Cordes Apr 16 '21 at 04:18
All true EXCEPT the tag of x86 implies 8088 thru whatever is current Intel, AMD, Cyrix. So XLAT may be a usefull suggestion for OP. As for memory bandwidth, the box I am writing this on limits at 1.33 transactions sec x 16 bytes = 20 G/s. But if fetch store byte at a time??? The rest of my answer is directly from intel performance manual except the bit about how much loop to unwind. Unfortunately all this is mostly beyond OP control if using optimizing C or C++. Also, 3 of the 6 modes (5 Documented + "Unreal") still 8088 and we don't know which the OP is targeting... – sys101 Apr 17 '21 at 08:05
The querent mentions their CPU is an i7-8700,3.2GHz, so the implication is that they're talking about modern x86. That makes xlat non-useful in any mode. And unless they say otherwise (e.g. no [x86-16] tag), a normal person should assume they mean 64-bit mode with a modern C or C++ compiler, possibly with portability to 32-bit mode. Also, "when the input hex string is very big (such as 1000000 length)" also rules out real mode (and thus 8086 / 8088 / 186), because that would fill the entire 1MiB physical address space with just the input. – Peter Cordes Apr 17 '21 at 08:23
*But if fetch store byte at a time???* - Then cache does its job, and lets you do up to 2 loads + 1 store per clock to L1d cache (since Haswell or Zen2 when AGU limits are included), only requiring hardware prefetch to keep up with touching a new 64-byte line, for a total bandwidth of maybe something like 1.5x clock speed. – Peter Cordes Apr 17 '21 at 08:25
1

There's some good stuff in your answer, that's why I didn't downvote. It would be good to [edit] to turn into paragraphs, and if you *want* to say something about tuning for 8086 / 286 then put that in a separate section with a `### heading` that points that out. – Peter Cordes Apr 17 '21 at 08:28

Is there an algorithm to convert massive hex string to bytes stream QUICKLY? asm/C/C++

5 Answers5

Linked