Is integer vectorization accuracy / precision of integer division CPU-dependent?

Question

I tried to vectorize the premultiplication of 64-bit colors of 16-bit integer ARGB channels.

I quickly realized that due to lack of accelerated integer division support I need to convert my values to float and use some SSE2/SSE4.1 intrinsics explicitly for the best performance. Still, I wanted to leave the non-specific generic version as a fallback solution (I know that it's currently slower than some vanilla operations but it would provide future compatibility for possible improvements).

However, the results are incorrect on my machine.

A very minimal repro:

// Test color with 50% alpha
(ushort A, ushort R, ushort G, ushort B) c = (0x8000, 0xFFFF, 0xFFFF, 0xFFFF);

// Minimal version of the fallback logic if HW intrinsics cannot be used:
Vector128<uint> v = Vector128.Create(c.R, c.G, c.B, 0u);
v = v * c.A / Vector128.Create(0xFFFFu);
var cPre = (c.A, (ushort)v[0], (ushort)v[1], (ushort)v[2]);

// Original color:
Console.WriteLine(c); // prints (32768, 65535, 65535, 65535)

// Expected premultiplied color:   (32768, 32768, 32768, 32768)
Console.WriteLine(cPre); // prints (32768, 32769, 32769, 32769)

I tried to determine what instructions are emitted causing the inaccuracy but I was really surprised to see that in SharpLab the results are correct. On the other hand, the issue is reproducible in .NET Fiddle.

Is it something that's expected on some platforms or should I report it in the runtime repo as a bug?

Update

Nevermind, this is clearly a bug. Using other values cause totally wrong results:

using System;
using System.Numerics;
using System.Runtime.Intrinsics;

(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);

Vector128<uint> v1 = Vector128.Create(c.R, c.G, c.B, 0u);
v1 = v1 * c.A / Vector128.Create(0xFFFFu);

// prints <32769, 49152, 57344, 0> instead of <32768, 16384, 8192, 0>
Console.WriteLine(v1);

// Also for the older Vector<T>
Span<uint> span = stackalloc uint[Vector<uint>.Count];
span[0] = c.R;
span[1] = c.G;
span[2] = c.B;
Vector<uint> v2 = new Vector<uint>(span) * c.A / new Vector<uint>(0xFFFF);

// prints <32769, 49152, 57344, 0, 0, 0, 0, 0> on my machine
Console.WriteLine(v2);

In the end I realized that the issue was at the multiplication: if I replace * c.A to the constant expression * 32768, then the result is correct. For some reason the ushort value is not correctly extracted/masked(?) out from the packed field. Even Vector.Create is affected:

(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);

Console.WriteLine(Vector128.Create((int)c.A)); // -32768
Console.WriteLine(Vector128.Create((int)32768)); // 32768
Console.WriteLine(Vector128.Create((int)c.A, (int)c.A, (int)c.A, (int)c.A)); // 32768

Update 2

In the end filed an issue in the runtime repo

In the meantime I realized that SharpLab also produces the wrong result in Debug mode. Strange, because on my computer both Debug and Release builds are incorrect. So I start to believe this is a bug after all. — György Kőszeg, Mar 14 '23 at 11:51
This can be done with integer multiply and shifts, although it's not super efficient since it needs the high half of a 32x32 => 64-bit multiply, and SSE2 / AVX only gives you that with `pmuludq` which gives you the full 64-bit results. So only half the input elements per vector. https://godbolt.org/z/7E9P9aWMh shows GCC and clang using a multiplicative inverse for dividing a vector of 4x `uint32_t` by `0xffffu`. This is exact integer division, no FP involved at any point. — Peter Cordes, Mar 14 '23 at 12:04
I don't see anything in the software fallback path which would cause this... I'd be tempted to either raise this in the dotnet/runtime repo, or ask in `#allow-unsafe-blocks` on [discord](https://discord.com/invite/ccyrDKv) (lots of JIT people hang out there, and they'll tell you if you need to open an issue) — canton7, Mar 14 '23 at 12:10
If you were using FP, it would probably perform better to multiply by `1.0f/0xffff` (or however you write a `float` constant in C#), although that can't be represented exactly. But with about 23 bits of precision, might still give the correct 16-bit integer after rounding or truncating to 16-bit. — Peter Cordes, Mar 14 '23 at 12:11
@PeterCordes: yes, bit shifting is faster with vanilla code than division but the results are not exactly the same. And the floating point version now works well, it's just about the issue I noticed. — György Kőszeg, Mar 14 '23 at 12:22
Alright, this must be a bug, using some other values end up in total off results. I will update the post soon. — György Kőszeg, Mar 14 '23 at 12:24
@GyörgyKőszeg: The compiler output I linked is for `vec / 0xffffu`. The results *are* exact, using a multiplicative inverse like I said, same as compilers do for scalar `uint / constant`. [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935) Did you think I meant `v >> 16`? Oh, you probably thought I meant doing the `*` part of you expression with an integer multiply; I was talking about using the high half of integer multiply as part of the division. — Peter Cordes, Mar 14 '23 at 12:24

Is integer vectorization accuracy / precision of integer division CPU-dependent?

Update

Update 2

0 Answers0