C# and SIMD: High and low speedups. What is happening?

Question

Introduction of the problem

I am trying to speed up the intersection code of a (2d) ray tracer that I am writing. I am using C# and the System.Numerics library to bring the speed of SIMD instructions.

The problem is that I am getting strange results, with over-the-roof speedups and rather low speedups. My question is, why is one over-the-roof whereas the other is rather low?

Context:

The RayPack struct is a series of (different) rays, packed in Vectors of System.Numerics.
The BoundingBoxPack and CirclePack struct is a single bb / circle, packed in vectors of System.Numerics.
The CPU used is an i7-4710HQ (Haswell), with SSE 4.2, AVX(2), and FMA(3) instructions.
Running in release mode (64 bit). The project runs in .Net Framework 472. No additional options set.

Attempts

I've already tried looking up whether some operations may or may not be properly supported (Take note: these are for c++. https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/ or http://sci.tuomastonteri.fi/programming/sse), but it seems that is not the case because the laptop I work on supports SSE 4.2.

In the current code, the following changes are applied:

Using more proper instructions (packed min, for example).
Not using the float * vector instruction (causes a lot of additional operations, see the assembly of the original).

Code ... snippets?

Apologies for the large amount of code, but I am not sure how we can discuss this concretely without this amount of code.

Code of Ray -> BoundingBox

public bool CollidesWith(Ray ray, out float t)
{
    // https://gamedev.stackexchange.com/questions/18436/most-efficient-aabb-vs-ray-collision-algorithms
    // r.dir is unit direction vector of ray
    float dirfracx = 1.0f / ray.direction.X;
    float dirfracy = 1.0f / ray.direction.Y;
    // lb is the corner of AABB with minimal coordinates - left bottom, rt is maximal corner
    // r.org is origin of ray
    float t1 = (this.rx.min - ray.origin.X) * dirfracx;
    float t2 = (this.rx.max - ray.origin.X) * dirfracx;
    float t3 = (this.ry.min - ray.origin.Y) * dirfracy;
    float t4 = (this.ry.max - ray.origin.Y) * dirfracy;

    float tmin = Math.Max(Math.Min(t1, t2), Math.Min(t3, t4));
    float tmax = Math.Min(Math.Max(t1, t2), Math.Max(t3, t4));

    // if tmax < 0, ray (line) is intersecting AABB, but the whole AABB is behind us
    if (tmax < 0)
    {
        t = tmax;
        return false;
    }

    // if tmin > tmax, ray doesn't intersect AABB
    if (tmin > tmax)
    {
        t = tmax;
        return false;
    }

    t = tmin;
    return true;
}

Code of RayPack -> BoundingBoxPack

public Vector<int> CollidesWith(ref RayPack ray, out Vector<float> t)
{
    // ------------------------------------------------------- \\
    // compute the collision.                                  \\

    Vector<float> dirfracx = Constants.ones / ray.direction.x;
    Vector<float> dirfracy = Constants.ones / ray.direction.y;

    Vector<float> t1 = (this.rx.min - ray.origin.x) * dirfracx;
    Vector<float> t2 = (this.rx.max - ray.origin.x) * dirfracx;
    Vector<float> t3 = (this.ry.min - ray.origin.y) * dirfracy;
    Vector<float> t4 = (this.ry.max - ray.origin.y) * dirfracy;

    Vector<float> tmin = Vector.Max(Vector.Min(t1, t2), Vector.Min(t3, t4));
    Vector<float> tmax = Vector.Min(Vector.Max(t1, t2), Vector.Max(t3, t4));

    Vector<int> lessThanZeroMask = Vector.GreaterThan(tmax, Constants.zeros);
    Vector<int> greaterMask = Vector.LessThan(tmin, tmax);
    Vector<int> combinedMask = Vector.BitwiseOr(lessThanZeroMask, greaterMask);

    // ------------------------------------------------------- \\
    // Keep track of the t's that collided.                    \\

    t = Vector.ConditionalSelect(combinedMask, tmin, Constants.floatMax);

    return combinedMask;
}

Code of Ray -> Circle

public bool Intersect(Circle other)
{
    // Step 0: Work everything out on paper!

    // Step 1: Gather all the relevant data.
    float ox = this.origin.X;
    float dx = this.direction.X;

    float oy = this.origin.Y;
    float dy = this.direction.Y;

    float x0 = other.origin.X;
    float y0 = other.origin.Y;
    float cr = other.radius;

    // Step 2: compute the substitutions.
    float p = ox - x0;
    float q = oy - y0;

    float r = 2 * p * dx;
    float s = 2 * q * dy;

    // Step 3: compute the substitutions, check if there is a collision.
    float a = dx * dx + dy * dy;
    float b = r + s;
    float c = p * p + q * q - cr * cr;

    float DSqrt = b * b - 4 * a * c;

    // no collision possible! Commented out to make the benchmark more fair
    //if (DSqrt < 0)
    //{ return false; }

    // Step 4: compute the substitutions.
    float D = (float)Math.Sqrt(DSqrt);

    float t0 = (-b + D) / (2 * a);
    float t1 = (-b - D) / (2 * a);

    float ti = Math.Min(t0, t1);
    if(ti > 0 && ti < t)
    {
        t = ti;
        return true;
    }

    return false;
}

Code of RayPack -> CirclePack Original, unedited, code can be found at: https://pastebin.com/87nYgZrv

public Vector<int> Intersect(CirclePack other)
{
    // ------------------------------------------------------- \\
    // Put all the data on the stack.                          \\

    Vector<float> zeros = Constants.zeros;
    Vector<float> twos = Constants.twos;
    Vector<float> fours = Constants.fours;

    // ------------------------------------------------------- \\
    // Compute whether the ray collides with the circle. This  \\
    // is stored in the 'mask' vector.                         \\

    Vector<float> p = this.origin.x - other.origin.x; ;
    Vector<float> q = this.origin.y - other.origin.y;

    Vector<float> r = twos * p * this.direction.x;
    Vector<float> s = twos * q * this.direction.y; ;

    Vector<float> a = this.direction.x * this.direction.x + this.direction.y * this.direction.y;
    Vector<float> b = r + s;
    Vector<float> c = p * p + q * q - other.radius * other.radius;

    Vector<float> DSqrt = b * b - fours * a * c;

    Vector<int> maskCollision = Vector.GreaterThan(DSqrt, zeros);

    // Commented out to make the benchmark more fair.
    //if (Vector.Dot(maskCollision, maskCollision) == 0)
    //{ return maskCollision; }

    // ------------------------------------------------------- \\
    // Update t if and only if there is a collision. Take      \\
    // note of the conditional where we compute t.             \\

    Vector<float> D = Vector.SquareRoot(DSqrt);

    Vector<float> bMinus = Vector.Negate(b);
    Vector<float> twoA = twos * a;
    Vector<float> t0 = (bMinus + D) / twoA;
    Vector<float> t1 = (bMinus - D) / twoA;
    Vector<float> tm = Vector.ConditionalSelect(Vector.LessThan(t1, t0), t1, t0);

    Vector<int> maskBiggerThanZero = Vector.GreaterThan(tm, zeros);
    Vector<int> maskSmallerThanT = Vector.LessThan(tm, this.t);

    Vector<int> mask = Vector.BitwiseAnd(
        maskCollision,
        Vector.BitwiseAnd(
            maskBiggerThanZero,
            maskSmallerThanT)
            );

    this.t = Vector.ConditionalSelect(
        mask,                                                           // the bit mask that allows us to choose.
        tm,                                                             // the smallest of the t's.
        t);                                                             // if the bit mask is false (0), then we get our original t.

    return mask;
}

Assembly code

These can be found on pastebin. Take note that there is some boilerplate assembly from the benchmark tool. You need to look at the function calls.

BoundingBox(Pack): https://pastebin.com/RYMQdZMh
Circle(Pack) Tweaked: https://pastebin.com/YZHjc1vY
Circle(Pack) Original: https://pastebin.com/87nYgZrv

Benchmarking

I've been benchmarking the situation with BenchmarkDotNet.

Results for Circle / CirclePack (updated):

Method	Mean	Error	StdDev
Intersection	9.710 ms	0.0540 ms	0.0505 ms
IntersectionPacked	3.296 ms	0.0055 ms	0.0051 ms

Results for BoundingBox / BoundingBoxPacked:

Method	Mean	Error	StdDev
Intersection	24.269 ms	0.2663 ms	0.2491 ms
IntersectionPacked	1.152 ms	0.0229 ms	0.0264 ms

Due to AVX, a speedup of roughly 6x-8x is expected. The speedup of the boundingbox is significant, whereas the speedup of the circle is rather low.

Revisiting the question at the top: Why is one speedup over-the-roof and the other rather low? And how can the lower of the two (CirclePack) become faster?

Edit(s) with regard to Peter Cordes (comments)

Made the benchmark more fair: the single ray version does not early-branch-out as soon as the ray can no longer collide. Now the speedup is roughly 2.5x.
Added the assembly code as a separate header.
With regard to the square root: This does have impact, but not as much as it seems. Removing the vector square root reduces the total time with about 0.3ms. The single ray code now always performs the square root too.
Question about FMA (Fused Multiply Addition) in C#. I think it does for scalars (see Can C# make use of fused multiply-add?), but I haven't found a similar operation within the System.Numerics.Vector struct.
About a C# instruction for packed min: Yes it does. Silly me. I even used it already.

What hardware? (CPU microarchitecture = Haswell? Skylake? Ryzen? And don't just say "i7" or something, that would be useless.) What compiler+runtime / version / options? Presumably you're using Release mode if you were able to get good speedups with some code. — Peter Cordes, Jul 09 '19 at 11:56
My apologies! I've been over this for a few days now, the question appeared narrow to me but I think a form of tunnel vision got to me. It is indeed rather broad. With regard to the major rules of SO, is this the topic I should be referring to: https://stackoverflow.com/help/how-to-ask for in the future? — Willem124, Jul 09 '19 at 11:58
@Hille: looks pretty much fine to me (the problem is missing key details, not generally too broad); it's a specific performance question about exactly why one thing only gets a small speedup over scalar, and how to optimize Circle intersection better with SIMD. That's definitely answerable if we have the JIT-compiled asm and the details of what CPU it runs on. (This is a fairly good performance question: including a scalar baseline for each is very good, and using a good microbenchmark framework) — Peter Cordes, Jul 09 '19 at 12:04
@PeterCordes: Added in the additional information (system / projects). I'm not fully aware on how to gain the JIT-compiled assembly code, I'll search around for that. I'll update the post once I have it. Edit: I added the information under 'context'. — Willem124, Jul 09 '19 at 12:09
@Willem: The scalar version of Circle intersect has an early-out branch. Is that mostly taken with your randomized test data, avoiding running a `sqrt` instruction at all? `vsqrtps ymm` has *much* worse throughput than FMA, depending on the hardware. The worst case would be Intel Haswell (2 per clock 256-bit FMA throughput, but without Broadwell and Skylake's improvements to the divider.) Does C# have a "fast math" option that lets it contract mul + add operations in the source into FMA in the asm? And/or let it use an `rsqrtps` approximation? (I know C++ and x86 asm / performance.) — Peter Cordes, Jul 09 '19 at 12:09
Heh, you do have a Haswell. Might well be a bottleneck on `vsqrtps ymm` throughput of 1 per 14 cycles; See [Floating point division vs floating point multiplication](//stackoverflow.com/a/45899202) - FP sqrt throughput and latency is pretty similar to FP division on most CPUs, and identical on Haswell. Your SIMD version does it all branchlessly instead of attempting an early-out (which would only work if all 8 vector elements had a negative value there.) — Peter Cordes, Jul 09 '19 at 12:15
BTW, AVX implies SSE4.2 and earlier Intel SSE extensions. (Not AMD SSE4a or AMD XOP of course.) The 256-bit version of `Vector.ConditionalSelect` ([`vblendvps ymm`](https://www.felixcloutier.com/x86/blendvps)) requires AVX, not SSE4.1 anyway. — Peter Cordes, Jul 09 '19 at 12:21
`Vector.ConditionalSelect( Vector.LessThan(...), ...)` Does C# not have an intrinsic for packed-min? SSE/AVX has `minps` in hardware, but I'd worry that this source would trick the compiler into doing something horrible like `cmpps` -> `blendvps` — Peter Cordes, Jul 09 '19 at 12:24
Your speedup of > 8x for the BoundingBox function indicates that the scalar version wasn't doing well. Probably you were getting branch mispredicts, so branchless was a huge win there. Writing the scalar source differently (maybe with ternary or boolean operations on compare results) instead of `if()` statements might avoid that, depending on how well the JIT does at using instructions like `cmpss` instead of FP compare into integer FLAGS with `ucomiss` — Peter Cordes, Jul 09 '19 at 12:31
@PeterCordes: Thank you for your elaborate response! I've read the stackoverflow article that you linked too. Quite informative! The speed up is still not x8, but with the more realistic benchmark (no early-branch-out) it is already quite closer. Also learning about the assembly code was quite informative too. Do you have any further suggestions? — Willem124, Jul 09 '19 at 16:44
https://agner.org/optimize/ is *excellent*. See https://stackoverflow.com/tags/x86/info for other links to performance info. But keep in mind, if your real-life use cases *do* allow an early-out that predicts decently at all for scalar, that's something a scalar implementation *should* take advantage of but a SIMD implementation can't. It's not unfair unless your benchmark was letting the scalar version take the early out *way* too often. — Peter Cordes, Jul 10 '19 at 00:45
Hi, do you still have this problem 3 years later, since there is nothing marked as an answer, although nobody put anything there, but you had help. — majixin, Jul 11 '22 at 13:26
@najixin - Indeed a good question. There were 13 attempts to answer it, all of them got deleted by moderators. — Matt, Jan 20 '23 at 06:55

score 1 · Answer 1 · answered Jan 25 '23 at 16:11

I am not going to try to answer the question about SIMD speedup, but provide some detailed comments on poor coding in the scalar version that carried over to the vector version, in a way that doesn't fit in an SO comment.

This code in Intersect(Circle) is just absurd:

// Step 3: compute the substitutions, check if there is a collision.
float a = dx * dx + dy * dy;

sum of squares -> a is guaranteed non-negative

float b = r + s;
float c = p * p + q * q - cr * cr;

float DSqrt = b * b - 4 * a * c;

This isn't the square root of D, it's the square of D.

// no collision possible! Commented out to make the benchmark more fair
//if (DSqrt < 0)
//{ return false; }

// Step 4: compute the substitutions.
float D = (float)Math.Sqrt(DSqrt);

sqrt has a limited domain. Avoiding the call for negative input doesn't just save the average cost of the square root, it prevents you from having to handle NaNs which are very, very slow.

Also, D is non-negative, since Math.Sqrt returns either the positive branch or NaN.

float t0 = (-b + D) / (2 * a);
float t1 = (-b - D) / (2 * a);

The difference between these two is t0 - t1 = D / a. The ratio of two non-negative variables is also non-negative. Therefore t1 is never larger.

float ti = Math.Min(t0, t1);

This call always selects t1. Computing t0 and testing which is larger is a waste.

if(ti > 0 && ti < t)
{
    t = ti;
    return true;
}

Recalling that ti is always t1, and a is non-negative, the first test is equivalent to -b - D > 0 or b < -D.

In the SIMD version, Vector.SquareRoot documentation does not describe the behavior when inputs are negative. And Vector.LessThan does not describe the behavior when inputs are NaN.

*NaNs which are very, very slow.* - That's actually not the case for SSE/AVX. Subnormals have a penalty, but unlike x87, NaN input produces NaN output with the same performance as finite inputs. (This is true on all Intel CPUs at least.) e.g. [Huge performance difference (26x faster) when compiling for 32 and 64 bits](https://stackoverflow.com/q/31875464) shows this effect, where 32-bit code used x87. (And all values in the benchmark are NaN, due to careless initialization.) Heh, my answer from 2015 was not very confident. — Peter Cordes, Jan 25 '23 at 16:28
I wonder if some of the scalar version optimized away some of that redundancy, leading to a smaller SIMD speedup. Or maybe branchy scalar just avoided some work (with no mispredictions) vs. branchless SIMD. Branch mispredicts are probably what made the scalar rectangle boundingbox extra slow, leaving room for SIMD to give bigger speedups than 8x. Oh, I [commented the same thing](https://stackoverflow.com/questions/56951793/c-sharp-and-simd-high-and-low-speedups-what-is-happening/75236750#comment100443703_56951793) when the question was new. — Peter Cordes, Jan 25 '23 at 16:33
SIMD 256-bit sqrt throughput [isn't 8x scalar](https://stackoverflow.com/a/45899202), so use of sqrt might well be a big part of why they didn't get 8x. — Peter Cordes, Jan 25 '23 at 16:35

C# and SIMD: High and low speedups. What is happening?

1 Answers1

Linked