What are assembly instructions like PEXT actually used for?

Question

I watched a youtube video on the Top 10 Craziest Assembly Language Instructions and some of these instructions have no obvious application to me. What's the point of something like PEXT, which takes only the bits from the second argument which match indices of 1s in the first argument? How would the compiler know when to use this instruction? Same/similar questions about carry-less multiplication.

Disclaimer: I know little to nothing about assembly language. Maybe I should read up on it!

I hope this question is stackoverflow-appropriate.

When you searched the internet for uses of `pext` and carry-less multiplication, what did you find? — njuffa, Nov 14 '21 at 19:26
There are a number of different applications. A popular one is to compute bishop attacks in chess programming. I used it as a part of a perfect hash function for combinatorial search problems. — fuz, Nov 14 '21 at 19:30
Mostly I found descriptions of what they are and similar implementations of bitwise manipulation. — wesmlr, Nov 14 '21 at 19:45
A couple uses I've run across include [How to unset N right-most set bits](https://stackoverflow.com/q/65817459), and multiple parts of [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) where I used pext as a bit version of left-packing, and pdep to expand a bitmask to bytes. Google `site:stackoverflow.com pext pdep` for more. IDK if there are any cases where a compiler would use it for you; normally you'd use it via intrinsics, i.e. `_pdep_u64` — Peter Cordes, Nov 14 '21 at 20:15
Carryless multiplication is equivalent to multiplication of polynomials with coefficients in the integers mod 2, see https://en.wikipedia.org/wiki/Carry-less_product. Also known as Galois field multiplication. It shows up in cryptography algorithms like AES and in error-correcting codes. So there are definite uses for this. — Nate Eldredge, Nov 14 '21 at 23:09
This is gonna sound kinda stupid, but forgive me. When would one have to manipulate bits so finely like that? Definitely a "lack of experience" question. It seems very useful for the compiler, I mean more for the programmer. Sorry if I'm misunderstanding something here! — wesmlr, Nov 15 '21 at 06:00
Unfortunately(?) it's the other way around: compilers are relatively clueless when it comes to these instructions, but humans can use them in various inventive ways. BTW a neat use of PEXT is in [calculating the indexes of set bits](https://branchfree.org/2018/05/22/bits-to-indexes-in-bmi2-and-avx-512/) — harold, Nov 15 '21 at 19:14
My impression of `pdep/pext` is that they are sort of a "because we can" instruction. It has various special cases that could be useful (packing/unpacking values scattered across bitfields, interleaving bits for [Z-ordering](https://en.wikipedia.org/wiki/Z-order_curve), etc) and so perhaps they just went ahead and implemented a generic version to allow for programmers to be creative. It's one of those things, like popcount or lzcount, that can be done many times more efficiently in hardware with a dedicated instruction than in software using existing instructions. — Nate Eldredge, Nov 15 '21 at 20:53

score 10 · Answer 1 · edited May 15 '22 at 20:08

You can find some applications listed in the paper regarding the hardware unit for PDEP/PEXT

There are many emerging applications, such as cryptography, imaging and biometrics, where more advanced bit manipulation operations are needed. While these can be built from the simpler logical and shift operations, the applications using these advanced bit manipulation operations are significantly sped up if the processor can support more powerful bit manipulation instructions. Such operations include arbitrary bit permutations, performing multiple bit-field extract operations in parallel, and performing multiple bit-field deposit operations in parallel. We call these permutation (perm), parallel extract (pex) or bit gather, and parallel deposit (pdep) or bit scatter operations, respectively.

Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors

Bit permutation is extremely common in bitboards, for example reverse bytes/words or mirror bit arrays. There are lots of algorithms in it that require extensive bit manipulation and people had to get creative to do that before the era of PEXT/PDEP. Later many card game engines also use that technique to deal with a single game set in just one or a few registers

PDEP/PEXT is also used to greatly improve bit interleaving performance, which is common in algorithms like Morton code. Some examples on this:

The multiplication technique invented for bitboards is also commonly used in many algorithms in Bit Twiddling Hacks, for example interleave bits with 64-bit multiply. This technique is no longer needed when PDEP/PEXT is available

You can find more detailed information in Bit permutations and Hacker's Delight

Another usage for PDEP/PEXT is to extract/combine fields where the bits are not in contiguous positions, for example disassemble RISC-V instructions where immediates scatter around to make hardware design simpler but also make it a bit messier to work with on software without PDEP/PEXT

Some other applications:

I think the pext / pdep instructions have HUGE implications to 4-coloring problem, 3-SAT, Constraint Solvers, etc. etc. More researchers probably should look into those two instructions.

Just look at Binary Decision Diagrams, and other such combinatorial data structures, and you can definitely see the potential uses of PEXT / PDEP all over the place.

https://news.ycombinator.com/item?id=19137260

How would the compiler know when to use this instruction?

Compilers can recognize common patterns and optimize the instruction sequence, but for advanced things like this then programmers usually need to explicitly call intrinsics from high level code

You write that "Later many card game engines also use that technique to deal with a single game set in just one or a few registers" Do you have any reference or info where I can read more about this? — Petter T, Nov 28 '22 at 11:00
@PetterT I don't remember where I read that, it was long ago. However some people definitely use it, especially in games that only care about the value and not the suits [More efficient way to store Playing Cards in bits?](https://stackoverflow.com/q/31852353/995714) — phuclv, Nov 29 '22 at 13:42

score 4 · Answer 2 · answered Mar 16 '22 at 20:19

PDEP (Parallel Deposit) and PEXT (Parallel Extract) are meant to be a convenient way to extract and deposit bit fields. I'd bet there are good low level use cases for them.

For actual uses - I wrote a Sudoku solver that used PEXT in couple functions to extract bit values. Thanks to PEXT I was able to extract 4 elements in a single instruction (vs 1 for normal approach). It was really convenient. If you'd really want I could put up a code snippet on Compiler Explorer to show the difference.

score 0 · Answer 3 · edited Nov 28 '22 at 12:00

0

The following isn't directly related to the usag of PDEP / PEXT since it's about the performance - but it affects if its usage makes sense. I've got a Zen2 Ryzen Threadripper 3990X CPU under Windows 11 and I tested the througput of PDEP and PEXT with the intrinsics of MSVC++ and Intel C++ under Windows and clang++ and g++ under Linux. Here's the code:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstdint>
#include <atomic>
#if defined(_MSC_VER)
    #include <intrin.h>
#elif defined(__GNUC__) || defined(__llvm__)
    #include <immintrin.h>
#endif

using namespace std;
using namespace chrono;

atomic_uint64_t aSum( 0 );

int main()
{
    constexpr size_t
        N = 0x1000,
        ROUNDS = 10'000;
    vector<uint64_t> data( N, 0 );
    mt19937_64 mt;
    uniform_int_distribution<uint64_t> uid( 0, -1 );
    for( uint64_t &d : data )
        d = uid( mt );
    auto pdep = []( uint64_t data, uint64_t mask ) -> uint64_t { return _pdep_u64( data, mask ); };
    auto pext = []( uint64_t data, uint64_t mask ) -> uint64_t { return _pext_u64( data, mask ); };
    auto bench = [&]<typename Permute>( Permute permute ) -> double
    {
        uint64_t sum = 0;
        auto start = high_resolution_clock::now();
        constexpr uint64_t MASK = 0x5555555555555555u;
        for( size_t r = ROUNDS; r--; )
            for( uint64_t d : data )
                sum += permute( d, MASK );
        double ns = (double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ((double)N * ROUNDS);
        ::aSum = sum;
        return ns;
    };
    cout << bench( pdep ) << endl;
    cout << bench( pext ) << endl;
}

According to the data on agner.org PDEP / PEXT should have a latency and througput of slightly below 20 clock cycles on my Zen2 CPU. On Intel since Haswell CPUs the latency is only 3 clock cycles and the throughput is a whopping one clock cycle.
But according to my measurements each instruction takes about 35ns, i.e. about 150 clock cycles on my CPU. There's no measurement error and the disassembly I checked matches what you'd write in assembly. So I'm curious about the data of other CPUs. Maybe you'll report it here. It would be helpful to assess if the usage of PDEP or PEXT makes sense.

edited Nov 28 '22 at 12:00

phuclv

37,963
15
156
475

answered May 14 '22 at 14:49

Bonita Montero

2,817
9
22

I've seen it said that PEXT/PDEP microcode on AMD before Zen3 has data-dependent performance, so some inputs are faster than others. Maybe Agner only checked with one input value which happened to be simple, like all-zero? – Peter Cordes May 14 '22 at 15:05
1

Anyway, this is clearly not an answer to this SO question, but rather a question you should post separately, with a title like "How do PEXT/PDEP perform on Zen 2 and earlier?" The question body can be mostly this, your experiment and results. Details on exactly how slow it is before Zen3 added HW support would be nice to have, it'd be a good question to have on SO. (We can edit this question and/or one of the answers to mention it being slow on many existing AMD CPUs, since that's good info for future readers to beware of. But it doesn't belong as a full answer here.) – Peter Cordes May 14 '22 at 15:06
@PeterCordes: I know that my question only partitially fits here, but when I google for "PDEP PEXT Stack Overflow" this is the only article about PDEP and PEXT her. So it is not the worst idea to post it here. And you're right, Peter, the results are varying accoring to the mask. If I have a mask of 0, i..e. extracting or compressing no bits, the results are according to Agner's data. Should I inform him ? Better not, I've aleady mailed him for another reason and got an dismissive reply. – Bonita Montero May 14 '22 at 16:18
1

This isn't an "article", it's a *question*, about something else. Finding a very different question involving the same function or instruction never makes it ok to post a new question as an answer on Stack Overflow. (Also, my google search for `site:stackoverflow.com pext pdep` found 137 hits. When I typoed PEXT as PEX, I only got 2, including this and a typoed comment on the AVX2 left-packing Q&A). I guess I can see why you thought it would make sense to post a performance note on a question about using it, so it's not as totally obvious as usual, but you wrote this like a question. – Peter Cordes May 14 '22 at 16:39
Re: emailing Agner Fog or posting on his blog forum: if you have specific facts to share, like that PEXT's data-dependent performance on Zen2 has 20 cycles as the *best* case, with worst being way worse, I think he'd find that useful. IDK what you emailed him about before, but this is fully relevant. – Peter Cordes May 14 '22 at 16:44
OTOH, Agner Fog's tables are no longer the best-maintained source for such info; https://uops.info/ automated the testing process enough to avoid human error in updating a spreadsheet, and to explore more corner cases. Andreas Abel maintains it, and is occasionally active on Stack Overflow. The https://uops.info/ tables do try to cover some fast vs. slow cases to give a throughput and latency range for `div` IIRC, so he might be interested in adding that for PEXT/PDEP. (He might be aware and just not have gotten around to it, but maybe not or maybe didn't realize how variable it can be.) – Peter Cordes May 14 '22 at 16:47
The https://uops.info/ tables don't actually show the throughput or uops range for variable instructions like `idiv` on conroe for example, only for latency. But slow vs. fast division is in the throughput test results: https://uops.info/html-tp/CON/IDIV_R64-Measurements.html – Peter Cordes May 14 '22 at 16:51
IDIV / DIV performance is obvious since iterative subtract and shift is the only working solition. But the performance of PDEP and PEXT is not necessarily so low. – Bonita Montero May 14 '22 at 17:41
My point was that uops.info does already have some limited support in its UI for displaying variable-latency. And in its database for at least having the slow vs. fast throughput details available for applicable instructions even if that part isn't shown in the table. So the infrastructure is there to make the web site do something with PDEP/PEXT, once you tell the relevant human about it. (And BTW, no, iterative subtract and shift isn't the only working algorithm. My understanding is that modern dividers use a table for an initial approximation and then use pipelined Newton-Raphson.) – Peter Cordes May 14 '22 at 18:36

What are assembly instructions like PEXT actually used for?

3 Answers3