Intel's description in their "future extensions" instruction set reference manual has the usual Operation
section which fully specifies which bits go where.
The Intrinsics Guide also reproduces the Operation
section, which is a nice change from some other poorly-documented entries in the intrinsics guide. Or maybe it's a recent addition. It does still leave out the tables and diagrams. I normally find the insn ref manual more useful, except sometimes when searching for instructions I might not have thought of or know about.
The Operation section for this instruction is long and hard to grok, and the English text description is only a rough summary:
Perform fix-up of quad-word elements encoded in double-precision
floating-point format in the first source operand (the second operand)
using a 32-bit, two-level look-up table specified in the corresponding
quadword element of the second source operand (the third operand) with
exception reporting specifier imm8
...
The two-level look-up table perform a fix-up of each DP FP input data
in the first source operand by decoding the input data encoding into 8
token types. A response table is defined for each token type that
converts the input encoding in the first source operand with one of 16
response actions.
The intended use-case is:
- dest=result of
rcppd
(or similar) + newton-raphson iteration
- src=input to the approximation + refinement
- table=fixup table. Can be a broadcast memory operand, so only 64 or 32 bits of memory are needed in the common case where you want the same table for every element of a vector. (The table is only 32b for both single and double precision, but the DP version's broadcast option is m64bcst. It's ok for the upper 32bits of that to be garbage, but not for it to cross a page boundary into an unmapped page: That will probably fault.)
Perhaps a more detailed English description would be useful to bridge the gap between that very rough summary and the full pseudocode:
For each src element:
tsrc
= flush denormals to zero if MXCSR.DAZ
is set. The original src is not used at all after this: there's no dest=src
action, only dest=tsrc
.
Categorize tsrc
as one of eight "token" types (QNAN, SNAN, zero, +1, -Inf, +Inf, negative value, positive value). If the imm8 is non-zero, exceptions will be triggered when a token of the matching type is found.
Use that category token to lookup an action in the corresponding element of the 3rd operand (which is a table of eight 4-bit codes, one for each token).
The action can be one of dest=dest, dest=tsrc, dest=NaN, dest=+/-Inf, dest=Inf with the sign of tsrc, dest=+/-0, dest=+/-1, dest=1/2, dest=90.0, dest=pi/2, or dest=MAX/MIN_FLOAT. See Intel's docs for which code maps to which action.
This process is done separately for each vector element.
A typical use would put the dest=dest
code for all the cases where the results we're fixing up will already be correct. Note that dest
is also an input operand, even with no writemask, because of the dest=dest
action.