0

I would like to speed up the following calculation using SIMD:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static double Dot(double x1, double x2, double y1, double y2)
{
    return x1 * y1 + x2 * y2;
}

I saw that there is Vector2.Dot but this is only for floats and not for doubles.

I cannot switch to .NET Core and therefor I cannot use Vector128 Create (double e0, double e1).

The above Dot method is used in the following code which computes a union of two sorted list (id arrays):

static void ChainRule(int x_n, int[] x_id, double[] x_jacobi, double x_diff,
                      int y_n, int[] y_id, double[] y_jacobi, double y_diff,
                      out int z_n, out int[] z_id, out double[] z_jacobi)
{
    int n = x_n + y_n;
    z_id = new int[n];
    z_jacobi = new double[n];
    int i = 0, ix = 0, iy = 0;
    while (ix < x_n && iy < y_n)
    {
        if (x_id[ix] < y_id[iy])
        {
            z_id[i] = x_id[ix];
            z_jacobi[i++] = x_diff * x_jacobi[ix++];
        }
        else if (y_id[iy] < x_id[ix])
        {
            z_id[i] = y_id[iy];
            z_jacobi[i++] = y_diff * y_jacobi[iy++];
        }
        else
        {
            z_id[i] = x_id[ix];
            z_jacobi[i++] = Dot(x_diff, y_diff, x_jacobi[ix++],  y_jacobi[iy++]);
        }
    }
    while (ix < x_n)
    {
        z_id[i] = x_id[ix];
        z_jacobi[i++] = x_diff * x_jacobi[ix++];
    }
    while (iy < y_n)
    {
        z_id[i] = y_id[iy];
        z_jacobi[i++] = y_diff * y_jacobi[iy++];
    }
    z_n = i;
}

I tried as well to precompute the product of x_diff and x_jacobi and the product of y_diff and y_jacobi with the following code:

    double[] x_diff_jacobi = new double[x_n];
    for (int i0 = 0; i0 < x_n; i0++)
        x_diff_jacobi[i0] = x_diff * x_jacobi[i0];
    double[] y_diff_jacobi = new double[y_n];
    for (int i0 = 0; i0 < y_n; i0++)
        y_diff_jacobi[i0] = y_diff * y_jacobi[i0];

This will simplify the calculation of z_jacobi, e.g.: z_jacobi[i++] = x_diff_jacobi[ix++] + y_diff_jacobi[iy++]. But this code is running slower than the one above. I think the problem is the initialization of the additional arrays x_diff_jacobi and y_diff_jacobi.

Any other ideas to speed up this code?

Wollmich
  • 1,616
  • 1
  • 18
  • 46
  • 3
    You could compile some C to run the intrinsic, but you will then suffer from P/Invoke overheads. – Aron Nov 22 '22 at 09:36
  • 3
    Looking at the [documentation](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.sse2.multiplyscalar?view=net-7.0#system-runtime-intrinsics-x86-sse2-multiplyscalar(system-runtime-intrinsics-vector128((system-double))-system-runtime-intrinsics-vector128((system-double)))) the operation you need requires Core 3.0 or higher. – Aron Nov 22 '22 at 09:39
  • Why not use `Parallel.For` as a stopgap? – Dai Nov 22 '22 at 09:52
  • @Dai, because the `Dot` method is called in a `While` loop. I'll edit my question and I'll post the calling code. – Wollmich Nov 22 '22 at 10:07
  • 1
    That is not possible. Nice to have a website to ask for tricky code, but one still has to understand what can be done. SIMD = single instruction multiple data, this method signature does not have multiple data. Turn the parameters into arrays and you'll have a shot. [Example](https://stackoverflow.com/questions/51225026/vectordouble-weak-simd-performance). – Hans Passant Nov 22 '22 at 10:20
  • 1
    If there are many runs where you're taking 2 or 4 elements from one array or the other, there could be some SIMD speedup. (and for the cleanup after you exhaust one array). And there are SIMD merge-sorts (using min/max as a comparator in a sorting network, also requiring shuffles like `shufps` so that may rule out C# generic SIMD without Avx.shufpd) but they don't go one element at a time and it might be hard to add duplicate detection. https://github.com/WojciechMula/simd-sort . You might need a stable sort (https://www.reddit.com/r/simd/comments/r3yfmm/faster_sorting_with_sorting_networks/). – Peter Cordes Nov 22 '22 at 14:36
  • Even detecting that `x_id[ix] < y_id[iy]` is true for four consecutive elements is a pain, I guess you'd have to broadcast `y_id[iy]` and do a SIMD compare for against `x_id[ix + 0..3]` (reversing the operands to `pcmpgtd` to check less-than). Then `pmovmskb` and check that the result was `0xffff`. And if it wasn't, you'd have to broadcast the low x element and check contiguous elements from the other. So it's a lot of work that only pays off if it's often true and you can SIMD load+scale+store multiple elements. – Peter Cordes Nov 22 '22 at 14:40
  • 1
    Do you have any information about `x_id` and `y_id`? Are they sorted/(strictly) monotonically increasing? How likely do you have `x_id[ix]==y_id[iy]`? – chtz Nov 22 '22 at 16:54
  • 1
    It looks like you compute `x_diff * x_jacobi[ix++]` for every `ix` at some point (and the same for `y_jacobi`) Can you do that computation at the point where you compute `x_jacobi` and lose the `x_diff` and `y_diff` parameters? Your `Dot` function would then reduce to a simple `x_jacobi[ix++] + y_jacobi[iy++]`. – chtz Nov 22 '22 at 16:59
  • @chtz The `x_id` and `y_id` are sorted. And I would say half of the values are in `x_id` and `y_id`. – Wollmich Nov 23 '22 at 09:45
  • @HansPassant For my second try using the intermediate arrays `x_diff_jacobi` and `y_diff_jacobi` I could use SIMD, right? But the problem then is the initialization of the additional arrays. – Wollmich Nov 23 '22 at 10:27
  • I'm not surprised that it slows down doing a separate pass to scale the inputs. Those multiplies can happen in the shadow of branch mispredicts on the integer compares, with out-of-order exec able to hide the scale+copy or dot() work. Since later loop iterations don't depend on it, an extra uop with like 4 cycles of latency doesn't create any extra stall. @chtz was suggesting doing that on the fly in the code that writes those arrays in the first place. If that's not possible, not a big deal, although it does use more resources for another thread sharing the same physical core. – Peter Cordes Nov 23 '22 at 10:29
  • TL:DR: out-of-order exec means those extra multiplies are probably (almost) free in terms of throughput cost when you do them as part of this merge loop that probably has frequent branch mispredicts. They're not on the critical path, and "fast recovery" means branch mispredict recovery can start before waiting for all older instructions to retire. – Peter Cordes Nov 23 '22 at 10:31

0 Answers0