6

I was looking through the instruction set in AVX-512 and noticed a set of fixup instructions. Some examples:

_mm512_fixupimm_pd, 
_mm512_mask_fixupimm_pd, 
_mm512_maskz_fixupimm_pd

_mm512_fixupimm_round_pd, 
_mm512_mask_fixupimm_round_pd, 
_mm512_maskz_fixupimm_round_pd

What is meant here by "fixing up"?

Simon Verbeke
  • 2,905
  • 8
  • 36
  • 55

2 Answers2

4

That's a great question. Intel's answer (my bold) is here:

This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction sequences to reflect special-number inputs. For example, consider rcp(0). Input 0 to rcp, and you should get INF according to the DX10 spec. However, evaluating rcp via Newton-Raphson, where x=approx(1/0), yields an incorrect result. To deal with this, VFIXUPIMMPS can be used after the N-R reciprocal sequence to set the result to the correct value (i.e. INF when the input is 0).

Look for VFIXUPIMMPD in:

https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf

JCx
  • 2,689
  • 22
  • 32
  • That's a lot more detailed than their online documentation, thanks for the reference! – Simon Verbeke May 13 '15 at 11:45
  • It's still a pretty rubbish manual if you ask me. Not up to the normal quality of writing. If I had a CPU support AVX-512 I'd give it a go and see what actually happens ;) – JCx May 13 '15 at 11:48
  • 1
    @JCx: I think the pseudo-code description of what it does is detailed enough (the `Operation` section). The paragraph you quote just gives you the use-case, not the operational details. Summary: for each src element, categorize it as one of eight "token" types. Use that token to lookup an action in the corresponding element of the 3rd operand (which is a table of eight 4-bit codes). The action can set dest=dest, dest=src, dest=NaN, dest=+/-Inf, dest=+/-0, dest=pi/2, or a few other things. Note that dest is also an input operand, even with no writemask. – Peter Cordes Jan 29 '16 at 02:42
2

Intel's description in their "future extensions" instruction set reference manual has the usual Operation section which fully specifies which bits go where.

The Intrinsics Guide also reproduces the Operation section, which is a nice change from some other poorly-documented entries in the intrinsics guide. Or maybe it's a recent addition. It does still leave out the tables and diagrams. I normally find the insn ref manual more useful, except sometimes when searching for instructions I might not have thought of or know about.

The Operation section for this instruction is long and hard to grok, and the English text description is only a rough summary:

Perform fix-up of quad-word elements encoded in double-precision floating-point format in the first source operand (the second operand) using a 32-bit, two-level look-up table specified in the corresponding quadword element of the second source operand (the third operand) with exception reporting specifier imm8

...

The two-level look-up table perform a fix-up of each DP FP input data in the first source operand by decoding the input data encoding into 8 token types. A response table is defined for each token type that converts the input encoding in the first source operand with one of 16 response actions.

The intended use-case is:

  • dest=result of rcppd (or similar) + newton-raphson iteration
  • src=input to the approximation + refinement
  • table=fixup table. Can be a broadcast memory operand, so only 64 or 32 bits of memory are needed in the common case where you want the same table for every element of a vector. (The table is only 32b for both single and double precision, but the DP version's broadcast option is m64bcst. It's ok for the upper 32bits of that to be garbage, but not for it to cross a page boundary into an unmapped page: That will probably fault.)

Perhaps a more detailed English description would be useful to bridge the gap between that very rough summary and the full pseudocode:

For each src element:

  • tsrc = flush denormals to zero if MXCSR.DAZ is set. The original src is not used at all after this: there's no dest=src action, only dest=tsrc.

  • Categorize tsrc as one of eight "token" types (QNAN, SNAN, zero, +1, -Inf, +Inf, negative value, positive value). If the imm8 is non-zero, exceptions will be triggered when a token of the matching type is found.

  • Use that category token to lookup an action in the corresponding element of the 3rd operand (which is a table of eight 4-bit codes, one for each token).

  • The action can be one of dest=dest, dest=tsrc, dest=NaN, dest=+/-Inf, dest=Inf with the sign of tsrc, dest=+/-0, dest=+/-1, dest=1/2, dest=90.0, dest=pi/2, or dest=MAX/MIN_FLOAT. See Intel's docs for which code maps to which action.

This process is done separately for each vector element.

A typical use would put the dest=dest code for all the cases where the results we're fixing up will already be correct. Note that dest is also an input operand, even with no writemask, because of the dest=dest action.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847