How to force C code to use a specific Intel instruction?

Question

I am calculating Entropy for a specific array of bytes. since I must calculate log2 of some numbers.

Here is part of the code

for (i = 0; i < ENTROPY_ARRAY_SIZE; i++) {
        if (entropy[i] == 0)
            continue;
        double p = (double)entropy[i] / data_len;
        H -= p * log2(p);
    }

We know that Intel supports two assembly instructions FYL2X and FYL2XP1 to perform the log2 operation faster. The problem is that even when I activated GCC optimization for compilation, the log2 assembly code is not based on Intel's specific instructions. Is there a way to force GCC to use specified instructions for this part of the code? (I checked and the CPU supports these instructions.)

With [GCC](https://gcc.gnu.org/) you can use [extended `asm`](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html). You might be interested by [GNU autoconf](https://www.gnu.org/software/autoconf/). You should check that using `FYL2X` is faster than what GCC generates with `gcc -march=native -O3 -ffast-math` and spend time reading about [GCC command options](https://gcc.gnu.org/onlinedocs/gcc/Invoking-GCC.html) — Basile Starynkevitch, Jun 26 '22 at 08:40
I guess that `log2` is part of `libm`. See [log2(3)](https://man7.org/linux/man-pages/man3/log2.3.html). It could happen that [GNU libc](https://www.gnu.org/software/libc/) or [musl libc](https://musl.libc.org) uses that machine instruction — Basile Starynkevitch, Jun 26 '22 at 08:45
Try using `__builtin_log2`. I believe that the machine instruction could be slower than the hand-tuned routine in libm — Basile Starynkevitch, Jun 26 '22 at 08:53
who tells you that `FYL2X` and `FYL2XP1` are faster? Intel math library doesn't even use it and instead has customized math functions using SIMD, so calling math functions in an array is much faster than scalar instructions like that — phuclv, Jun 26 '22 at 10:22
Probably this: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html — Shawn, Jun 26 '22 at 11:17
FYL2X is not Intel specific, and it's one of the slowest instructions, up to hundreds of cycles on some implementations — harold, Jun 26 '22 at 13:20
I should say that I tested with `-mavx2 -O3 -ffast-math -ftree-vectorize` and some other parts of the code vectorized with `SIMD` instructions but `log2` still is not compiled to `SIMD` or `FYL2X`. — hpirlo, Jun 26 '22 at 13:23
@hpirlo that's because the glibc `libm` doesn't have vectorized math for log. You have to use another vectorized libc like `libmvec` or Intel SVML. The Intel libraries are the most efficient one but they're not free though and you have to buy — phuclv, Jun 26 '22 at 13:32
@Kolodez [Intel SVML](https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-ops.html), specifically [`_mm256_log2_pd`](https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-short-vector-math-library-ops/intrinsics-for-logarithmic-operations/mm-log2-pd-mm256-log2-pd.html) and `_mm512_logb_pd` — phuclv, Jun 26 '22 at 13:37
@Shawn that's one of the special math libraries from Intel. I don't remember if it has vectorized math or not, but Intel SVML definitely has. For more vectorized math libraries see https://stackoverflow.com/a/36637424/995714, [SIMD math libraries for SSE and AVX](https://stackoverflow.com/q/15723995/995714), [Mathematical functions for SIMD registers](https://stackoverflow.com/q/40475140/995714), [Is it possible to get multiple sines in AVX/SSE?](https://stackoverflow.com/q/27933355/995714) — phuclv, Jun 26 '22 at 13:47
see also [Efficient implementation of `log2(__m256d)` in AVX2](https://stackoverflow.com/q/45770089/995714) — phuclv, Jun 26 '22 at 13:49
It's somewhat counterintuitive, but just because there is a dedicated instruction for an operation, does not mean it is the fastest way to perform that operation. The x87 transcendental functions are a notorious example; they are not used much by modern software, so CPU vendors do not try to optimize them, so software has even less reason to use them, and the vicious cycle continues. They still have to exist for compatibility, but may very well be much slower than a software implementation. — Nate Eldredge, Jun 26 '22 at 15:26
@phuclv I don't think I'd ever considered thinking of compiler intrinsics as a library. (And the MKL most definitely has vectorized routines; that's it's whole point.) — Shawn, Jun 26 '22 at 18:26
@Shawn: Intrinsics like `_mm_add_pd` are true intrinsics, wrapping the operation of a single CPU instruction. Things like `_mm256_log2_pd` and `_mm256_sin_pd` are really just library function calls that take a `__m256d` arg. Intel conflates things by listing SVML functions in their intrinsics guide. Another definition for "intrinsic" is something the compiler knows about and can do constant-propagation through, so if that's the case for `_mm256_sin_pd`, it's not totally crazy to call it an intrinsic. Otherwise just a vector math library function. — Peter Cordes, Jun 26 '22 at 19:27

How to force C code to use a specific Intel instruction?

0 Answers0