I tried to vectorize the premultiplication of 64-bit colors of 16-bit integer ARGB channels.
I quickly realized that due to lack of accelerated integer division support I need to convert my values to float
and use some SSE2/SSE4.1 intrinsics explicitly for the best performance. Still, I wanted to leave the non-specific generic version as a fallback solution (I know that it's currently slower than some vanilla operations but it would provide future compatibility for possible improvements).
However, the results are incorrect on my machine.
A very minimal repro:
// Test color with 50% alpha
(ushort A, ushort R, ushort G, ushort B) c = (0x8000, 0xFFFF, 0xFFFF, 0xFFFF);
// Minimal version of the fallback logic if HW intrinsics cannot be used:
Vector128<uint> v = Vector128.Create(c.R, c.G, c.B, 0u);
v = v * c.A / Vector128.Create(0xFFFFu);
var cPre = (c.A, (ushort)v[0], (ushort)v[1], (ushort)v[2]);
// Original color:
Console.WriteLine(c); // prints (32768, 65535, 65535, 65535)
// Expected premultiplied color: (32768, 32768, 32768, 32768)
Console.WriteLine(cPre); // prints (32768, 32769, 32769, 32769)
I tried to determine what instructions are emitted causing the inaccuracy but I was really surprised to see that in SharpLab the results are correct. On the other hand, the issue is reproducible in .NET Fiddle.
Is it something that's expected on some platforms or should I report it in the runtime repo as a bug?
Update
Nevermind, this is clearly a bug. Using other values cause totally wrong results:
using System;
using System.Numerics;
using System.Runtime.Intrinsics;
(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);
Vector128<uint> v1 = Vector128.Create(c.R, c.G, c.B, 0u);
v1 = v1 * c.A / Vector128.Create(0xFFFFu);
// prints <32769, 49152, 57344, 0> instead of <32768, 16384, 8192, 0>
Console.WriteLine(v1);
// Also for the older Vector<T>
Span<uint> span = stackalloc uint[Vector<uint>.Count];
span[0] = c.R;
span[1] = c.G;
span[2] = c.B;
Vector<uint> v2 = new Vector<uint>(span) * c.A / new Vector<uint>(0xFFFF);
// prints <32769, 49152, 57344, 0, 0, 0, 0, 0> on my machine
Console.WriteLine(v2);
In the end I realized that the issue was at the multiplication: if I replace * c.A
to the constant expression * 32768
, then the result is correct. For some reason the ushort
value is not correctly extracted/masked(?) out from the packed field. Even Vector.Create
is affected:
(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);
Console.WriteLine(Vector128.Create((int)c.A)); // -32768
Console.WriteLine(Vector128.Create((int)32768)); // 32768
Console.WriteLine(Vector128.Create((int)c.A, (int)c.A, (int)c.A, (int)c.A)); // 32768
Update 2
In the end filed an issue in the runtime repo