0

Do Intel CPUs support TBM (Trailing Bit Manipulation) instructions?

I am trying to use bextr on Intel with an immediate argument and getting a SIGILL when the CPUID bit for tbm is set.

Does this mean that Intel CPUs do not support TBM?

What is the proper way to check for TBM support? Should only check this bit if the vendor id is AuthenticAMD ?

gnzlbg
  • 7,135
  • 5
  • 53
  • 106
  • Even AMD has dropped support – harold Nov 07 '17 at 15:20
  • 1
    `bextr imm` is AMD-only. Intel only supports BMI/BMI2, which has some overlap with TBM. – Peter Cordes Nov 07 '17 at 15:43
  • Oops, [TBM is totally separate from BMI1](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_.28Trailing_Bit_Manipulation.29). AMD describes Piledriver and later BD-family CPUs as supporting BMI1 and TBM. And for completeness, [`bextr r,r/m,r`](http://felixcloutier.com/x86/BEXTR.html) is part of BMI1. But unfortunately current Intel CPUs run it as 2 uops, so it's usually not worth using even in loops where you could set up a register constant. `rorx` / `and` can usually get the same job done. Too bad x86 doesn't have [`rlwinm`](https://stackoverflow.com/q/30896622/224132) – Peter Cordes Nov 07 '17 at 17:48
  • 1
    @peter - `pextr` can get the job done in 1 uop if you can set up the extraction mask outside the loop. The downside is 3 cycles latency and it only runs on one port. It's interesting that such a general instruction runs at fewer uops than the much more constrained bextr. – BeeOnRope Nov 07 '17 at 20:57
  • @BeeOnRope: Oh right, I think you pointed this out once before. But still, PowerPC's `rlwinm` is totally the swiss-army knife of bitfield insert/extract instructions. Rotate and then mask off all bits outside a range (which doesn't have to be at the bottom of the register), when the rotate count and start/end of the bit range are all immediate operands. (Of course, if Intel CPUs implement bextr as 2 uops, they would probably also implement rlwinm as rorx + and. – Peter Cordes Nov 07 '17 at 21:01
  • @PeterCordes yeah, I probably did: after all it's tied with its twin `pdep` for my favorite x86 scalar instruction. About what's more general, I still think `pext` takes the cake. Yes, it can't get results except aligned to the bottom of the register, but it can extract _multiple_ fields at once and align them in a lot of different ways, and if you throw in a following `pdep` you can align them relatively _any_ way you want as long as their relative order is preserved. Another way to look at it is how hard it is to emulate one with the other... – BeeOnRope Nov 08 '17 at 05:56
  • @BeeOnRope: Yeah, `pext`/`pdep` are pretty fantastic. But they're so powerful that they're very slow on CPUs without dedicated HW for them (AMD Ryzen). And even on Intel they're 3c latency, and limited throughput. Surprisingly, though, KNL has fast pde/pext. (same perf as Skylake). I guess it needs that kind of HW for `vcompress*`, but the width / granularity are different. – Peter Cordes Nov 08 '17 at 08:34
  • 1
    @PeterCordes - you can read about the hardware design of such instructions [here](http://palms.ee.princeton.edu/PALMSopen/hilewitz06FastBitCompression.pdf). Indeed the most expensive part seems to be computing the control lines implied by the mask using a parallel prefix sum and so that part is actually worse for a 64-bit bit-granular compress compared to say a 64-byte wide `WORD`-granular operation. Of course the butterfly network itself needs much less space, I guess. Worth noting that Intel implemented the hardest variant, called "dynamic" in the paper. – BeeOnRope Nov 08 '17 at 19:21
  • 1
    It is definitely too bad AMD didn't implement them "in hardware" so to speak. The terrible performance makes them essentially unusable if you are targeting AMD: you just skip them or make a separate AMD-only codepath (or just let AMD use the non-BMI2 codepath if you have one). I guess AMD's hand was a bit forced: they implement BMI2 very efficiently for the remainder of the instructions (notably better than Intel in general: most instructions execute 4/cycle), and there is no separate flag for `pdep` or `pext` so not supporting them wasn't really an option. – BeeOnRope Nov 08 '17 at 19:25

1 Answers1

2

The intel instruction set reference, october 2017 version, certainly doesn't seem to list a version with an immediate operand. Similarly, no mention of a tbm flag in cpuid. It is bit #21 in the AMD specification which intel lists as reserved. It looks like indeed you will have to check the vendor id.

Out of curiosity, what intel cpu did you try this on that returns a 1 for the reserved tbm bit in cpuid?

Jester
  • 56,577
  • 4
  • 81
  • 125
  • An `Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz`. – gnzlbg Nov 07 '17 at 15:00
  • 1
    @gnzlbg: On an i7-6700k, for `eax=0x80000001 / CPUID` I get `ecx=0x121` (which doesn't have bit 21 set). In order from low to high bits, [that indicates LAHF in long mode, LZCNT, and PREFETCHW.](http://www.sandpile.org/x86/cpuid.htm#level_8000_0001h). I'm surprised you're getting bit 21 set. Sandpile doesn't mention it being used for something else on Intel. – Peter Cordes Nov 07 '17 at 18:14
  • @gnzlbg: Are you sure you're using CPUID correctly? Maybe try a stand-alone CPUID-dumping program like http://www.etallen.com/cpuid.html to double-check. e.g. run `cpuid -l 0x80000001 -r` to get a raw dump. – Peter Cordes Nov 07 '17 at 18:17
  • 2
    @PeterCordes ineed, I was checking processor info instead of extended processor info... – gnzlbg Nov 09 '17 at 11:48