2

This WikiChip article states that Neoverse V1 has int8 instructions that allow 256 operations per CPU clock (per core, presumably):

enter image description here

I'm trying to understand what these instructions are. Do they take int8 input and accumulate the results in int8's or int16s (risking overflow or requiring saturation), or do they accumulate into int32?

What are these instructions? Are they listed in https://developer.arm.com/documentation/dui0801/k/A64-SIMD-Vector-Instructions/ ?

MWB
  • 11,740
  • 6
  • 46
  • 91

1 Answers1

1

What are these instructions?

smopa for int8 and int16 types, bfmopa for FP16 type. They are documented there.

Do they take int8 input and accumulate the results in int8's or int16s (risking overflow or requiring saturation), or do they accumulate into int32?

The int8 version accumulates into int32.

Unfortunately, the documentation quality is mediocre. I would recommend ARM company to look for a good technical writer to document their hardware.

Still, I think that instruction does something like following C++. Untested because I don’t have a hardware which supports that ISA.

using std::array;
void smopa( array<int8_t, 32> a, array<int8_t, 32> b, array<array<int, 8>, 8>& acc,
    array<bool, 32> mask1, array<bool, 32> mask2, bool subtract )
{
    for( int r = 0; r < 8; r++ )
        for( int c = 0; c < 8; c++ )
        {
            int sum = acc[ r ][ c ];

            for( int i = 0; i < 4; i++ )
            {
                int ir = r * 4 + i;
                int ic = c * 4 + i;
                if( !( mask1[ ir ] && mask2[ ic ] ) )
                    continue;

                int p = (int)a[ ir ] * (int)b[ ic ];
                sum = subtract ? sum - p : sum + p;
            }
            acc[ r ][ c ] = sum;
        }
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • Thanks! So this must be a 2-cycle instruction? (Since it does `8*8*4*2` operations, while the image in the question says the chip can do `256` operations per cycle, where *"Multiply and accumulate is counted as two operations"*) – MWB Jul 13 '22 at 23:33
  • @MWB I don’t know. I have never used SME, only armv7 and arm64 editions of NEON. That’s how I can sometimes understand what’s written in that documentation. – Soonts Jul 13 '22 at 23:46
  • 1
    @MWB: Probably a throughput of 2c, yes. If there was a way for marketing to justify counting it as 512 int8 ops per clock, they probably would have done so. :P Good idea to work backward from the promotional material, in absence of an instruction-timing table. – Peter Cordes Jul 14 '22 at 00:11
  • 1
    The int8 matrix multiple instructions in Neoverse V1 are from the `FEAT_I8MM` feature of the Arm Architecture. An example is the SMMLA instruction https://developer.arm.com/documentation/ddi0602/2022-06/SIMD-FP-Instructions/SMMLA--vector---Signed-8-bit-integer-matrix-multiply-accumulate--vector-- The `smopa` and `bfmopa` instructions (https://developer.arm.com/documentation/ddi0602/2022-06/SME-Instructions/SMOPA--Signed-integer-sum-of-outer-products-and-accumulate-) are from the SME extension, which is not present in Neoverse V1 – Kyrill Jul 19 '22 at 14:35
  • @Kyrill `SMMLA` is defined as `MatMulAdd`, which does `2*2*8*2` ops. Do I understand it correctly that the throughput of this op is `64/256=1/4` cycles in Neoverse V1 then? – MWB Jul 25 '22 at 08:18
  • @Kyrill *"the SME extension, which is not present in Neoverse V1"* -- How come the image (in the question) says "SVE" in the bottom left corner? – MWB Jul 25 '22 at 08:36
  • I wonder how you arrived at the value of `dim = 8`? The `SMOPA` page defines `dim = VL DIV esize`, and `esize` can be either 32 or 64, but `VL` doesn't seem to be defined at all... – MWB Jul 28 '22 at 01:17
  • 1
    @MWB The OP’s link to wikichip.org says “There are now two 256b SVE vector units that can support up to two 256b operations/cycle.” It then says “When executing legacy NEON or FP operations, the vector units can support up to 4x128b operations/cycle”, A64 NEON vectors are exactly 128 bits, we can conclude the lowercase ‘b’ stands for ‘bits’. – Soonts Jul 28 '22 at 11:14
  • Thanks. So if I understand this correctly, in your version, `VL=256` (vector length?), `esize=32` and therefore `dim = VL/esize = 8`. But why not `esize = 64` and `dim = VL/esize = 4` ? – MWB Jul 28 '22 at 22:54
  • @MWB That instruction comes in 2 versions. The one in my example multiplies int8 numbers and updates int32 accumulators. Another version of that instruction multiplies int16 numbers, and updates int64 accumulators. – Soonts Jul 28 '22 at 23:05
  • @MWB You can find the latency and throughputs of the instructions on Neoverse V1 in the software optimization guide for the core at https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ Looks like the throughput is 4 per cycle for the Neon form and 2 per cycle for the SVE form (the core can process multiple such instructions per cycle) – Kyrill Aug 08 '22 at 10:19
  • "@Kyrill the SME extension, which is not present in Neoverse V1 -- How come the image (in the question) says "SVE" in the bottom left corner?" The Neoverse V1 supports the SVE extension, not the SME one. Both extensions contain instructions that help with matrix multiplication, but target different problem sizes. The SME extension, for example, targets larger matrix sizes and includes an architectural "tile" state that can be used to accumulate outer-product operations in. The SVE instructions target matmul operations that fit into an SVE register – Kyrill Aug 08 '22 at 10:22