What are these instructions?
smopa
for int8 and int16 types, bfmopa
for FP16 type. They are documented there.
Do they take int8 input and accumulate the results in int8's or int16s (risking overflow or requiring saturation), or do they accumulate into int32?
The int8 version accumulates into int32.
Unfortunately, the documentation quality is mediocre. I would recommend ARM company to look for a good technical writer to document their hardware.
Still, I think that instruction does something like following C++.
Untested because I don’t have a hardware which supports that ISA.
using std::array;
void smopa( array<int8_t, 32> a, array<int8_t, 32> b, array<array<int, 8>, 8>& acc,
array<bool, 32> mask1, array<bool, 32> mask2, bool subtract )
{
for( int r = 0; r < 8; r++ )
for( int c = 0; c < 8; c++ )
{
int sum = acc[ r ][ c ];
for( int i = 0; i < 4; i++ )
{
int ir = r * 4 + i;
int ic = c * 4 + i;
if( !( mask1[ ir ] && mask2[ ic ] ) )
continue;
int p = (int)a[ ir ] * (int)b[ ic ];
sum = subtract ? sum - p : sum + p;
}
acc[ r ][ c ] = sum;
}
}