3

I have some integer value representing a bitmask, for example 154 = 0b10011010, and I want to construct a corresponding signal Vector<T> instance <0, -1, 0, -1, -1, 0, 0, -1> (note the least/most significant bit order is reversed).

Is there any more performant way than this (beyond unrolling the loop)?

int mask = 0b10011010;// 154
// -1 = (unsigned) 0xFFFFFFFF is the "all true" value
Vector<int> maskVector = new Vector<int>(
    Enumerable.Range(0, Vector<int>.Count)
        .Select(i => (mask & (1 << i)) > 0 ? -1 : 0)
        .ToArray());
// <0, -1, 0, -1, -1, 0, 0, -1>
string maskVectorStr = string.Join("", maskVector);

(Note the debugger is bugged displaying Vector<T> values, showing only half of the components and the rest as zeros, hence my use of string.Join.)

Namely, is there a way with only a single instruction? It is the Single Instruction, Multiple Data (SIMD) datatype after all!

Furthermore, how I can do this when working with the generic Vector<T> version?


A signal vector, or integral mask vector, is used with the ConditionalSelect method to choose between values of two other masks:

//powers of two <1, 2, 4, 8, 16, 32, 64, 128>
Vector<int> ifTrueVector = new Vector<int>(Enumerable.Range(0, Vector<int>.Count).Select(i => 1 << i).ToArray());
Vector<int> ifFalseVector = Vector<int>.Zero;// or some other vector
// <0, 2, 0, 8, 16, 0, 0, 128>
Vector<int> resultVector = Vector.ConditionalSelect(maskVector, ifTrueVector, ifFalseVector);
string resultStr = string.Join("", resultVector);
// our original mask value back
int sum = Vector.Dot(resultVector, Vector<int>.One);

enter image description here


The documentation of ConditionalSelect explicitly says the mask vector has integral values for every overload, but spamming Vector<T>.Zero[0] and Vector<T>.One[0] to get them is surely improper? (And you can get the T version of -1 with (-Vector<T>.One)[0])

P.S. would there also be a corresponding solution to populating with powers of 2?

Elaskanator
  • 1,135
  • 10
  • 28
  • How did you get from `0b10011010` to `<0, -1, 0, -1, -1, 0, 0, -1>`? – canton7 Dec 21 '21 at 13:17
  • 1
    Reverse the order. The least significant bit is the first vector component. – Elaskanator Dec 21 '21 at 13:17
  • (Note that your `ConditionalSelect` could use be an `&`, I think? That means no need for `ifFalseVector`. [See the implementation](https://source.dot.net/#System.Private.CoreLib/Vector.cs,38584787341a65da)) – canton7 Dec 21 '21 at 13:22
  • I do not see anything wrong with your code. The operation requires looping 16 times. – jdweng Dec 21 '21 at 13:25
  • This is occurring in a performance-critical part of a rendering engine, which otherwise only performs two `ConditionalSelect` statements from the bitmask (and sometimes two subtractions), and intrinsics only take a couple CPU cycles. – Elaskanator Dec 21 '21 at 13:28
  • @canton7 this is simplified for my question. I'm using two nontrivial vectors with `ConditionalSelect`. – Elaskanator Dec 21 '21 at 13:29
  • 1
    Can you use `System.Runtime.Intrinsics.X86`? If so, there's x86 specific efficient solution – Alex Guteniev Dec 21 '21 at 18:03
  • As @Alex said, if you can use x86 intrinsics, see [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675) for a list of strategies for different mask width / vec element sizes, with C++ intrinsics that you can port. – Peter Cordes Dec 21 '21 at 20:11
  • Just curious, why did you accept no answer? – aepot Jan 26 '22 at 09:21
  • I need to spend some quality time testing the provided answers first. They both look quite good. – Elaskanator Feb 08 '22 at 17:20

2 Answers2

4

It is possible to do the complement of it with the vector operations offered by the Vector API, and equivalent variants could be made for any T, for example like this for T = int:

Vector<int> powersOfTwo;

Vector<int> MaskToElements(int mask)
{
    Vector<int> broadcasted = new Vector<int>(mask);
    Vector<int> singlebit = Vector.BitwiseAnd(broadcasted, powersOfTwo);
    return Vector.Equals(singlebit, powersOfTwo);
}

Where powersOfTwo is created like this:

int[] powersOfTwoArray = new int[Vector<int>.Count];
for (int i = 0; i < powersOfTwoArray.Length; i++)
{
    powersOfTwoArray[i] = 1 << i;
}
powersOfTwo = new Vector<int>(powersOfTwoArray);

It's not really generic, even though it would work for different types, largely because powersOfTwo must be precomputed otherwise this function won't make any sense: if there is a scalar loop in it anyway, what's the point.


If the mask is a constant, then with the System.Runtime.Intrinsics.X86 API it can go directly into a Blend, it doesn't need to be turned into a vector mask first, for example:

Vector128.AsInt32(Sse41.Blend(Vector128.AsSingle(a), Vector128.AsSingle(b), (byte)mask));

If the mask is not a constant, that API will still accept it, but it ends up calling a slow fallback. In that case, it's better to make a vector mask and use BlendVariable.

harold
  • 61,398
  • 6
  • 86
  • 164
  • 1
    Other strategies for other element widths shown in: [is there an inverse instruction to the movemask instruction in intel avx2?](https://stackoverflow.com/q/36488675). A different strategy is required for small elements, when the bitmask is wider than an element. e.g. 16 bits -> vector of 16x bytes. – Peter Cordes Dec 21 '21 at 20:14
3

There might be a fancy vector-based way to generate your mask vector, but simply optimizing your current code can speed things up by over an order of magnitude.

Firstly, don't use Linq on hot paths. The number of intermediate object allocations, virtual method calls and delegate invocations going on there is simply unnecessary if you're looking for speed. You can rewrite this as a for loop with no loss of clarity.

Secondly, get rid of that array allocation. Vector<T> has constructors which take a Span<T>, and you can stackalloc one of those.

That gives you some code which looks a bit like this:

int mask = 0b10011010;

Span<int> values = stackalloc int[Vector<int>.Count];
for (int i = 0; i < Vector<int>.Count; i++)
{
    values[i] = (mask & (1 << i)) > 0 ? -1 : 0;
}

var maskVector = new Vector<int>(values);

Interestingly, manually unrolling that loop gives you another significant speed-up:

Span<int> values = stackalloc int[Vector<int>.Count];
values[0] = (mask & 0x1) > 0 ? -1 : 0;
values[1] = (mask & 0x2) > 0 ? -1 : 0;
values[2] = (mask & 0x4) > 0 ? -1 : 0;
values[3] = (mask & 0x8) > 0 ? -1 : 0;
values[4] = (mask & 0x10) > 0 ? -1 : 0;
values[5] = (mask & 0x20) > 0 ? -1 : 0;
values[6] = (mask & 0x40) > 0 ? -1 : 0;
values[7] = (mask & 0x80) > 0 ? -1 : 0;

var maskVector = new Vector<int>(values);

How does this perform? Let's use BenchmarkDotNet:

[MemoryDiagnoser]
public class MyBenchmark
{
    [Benchmark, Arguments(0b10011010)]
    public Vector<int> Naive(int mask)
    {
        Vector<int> maskVector = new Vector<int>(
            Enumerable.Range(0, Vector<int>.Count)
                .Select(i => (mask & (1 << i)) > 0 ? -1 : 0)
                .ToArray());

        return maskVector;
    }

    [Benchmark, Arguments(0b10011010)]
    public Vector<int> Optimised(int mask)
    {
        Span<int> values = stackalloc int[Vector<int>.Count];
        for (int i = 0; i < Vector<int>.Count; i++)
        {
            values[i] = (mask & (1 << i)) > 0 ? -1 : 0;
        }

        var output = new Vector<int>(values);
        return output;
    }

    [Benchmark, Arguments(0b10011010)]
    public Vector<int> Optimised2(int mask)
    {
        Span<int> values = stackalloc int[Vector<int>.Count];
        values[0] = (mask & 0x1) > 0 ? -1 : 0;
        values[1] = (mask & 0x2) > 0 ? -1 : 0;
        values[2] = (mask & 0x4) > 0 ? -1 : 0;
        values[3] = (mask & 0x8) > 0 ? -1 : 0;
        values[4] = (mask & 0x10) > 0 ? -1 : 0;
        values[5] = (mask & 0x20) > 0 ? -1 : 0;
        values[6] = (mask & 0x40) > 0 ? -1 : 0;
        values[7] = (mask & 0x80) > 0 ? -1 : 0;

        var output = new Vector<int>(values);
        return output;
    }
}

public class Program
{
    public static void Main()
    {
        var summary = BenchmarkRunner.Run<MyBenchmark>();
    }
}

This gives the results:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1415 (21H2)
Intel Core i7-8565U CPU 1.80GHz (Whiskey Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.101
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET 5.0.1 (5.0.120.57516), X64 RyuJIT
Method mask Mean Error StdDev Gen 0 Allocated
Naive 154 103.018 ns 2.0509 ns 4.0001 ns 0.0554 232 B
Optimised 154 13.405 ns 0.3004 ns 0.4497 ns - -
Optimised2 154 9.668 ns 0.2827 ns 0.8245 ns - -
canton7
  • 37,633
  • 3
  • 64
  • 77
  • Dang I was hoping for some obvious, or special-purpose function to do it directly, but also dang good speedup! – Elaskanator Dec 21 '21 at 14:05
  • 1
    @Elaskanator Turns out unrolling that loop gets you another decent speed-up – canton7 Dec 21 '21 at 14:08
  • Is there any hope of generalizing this to the generic `Vector` and not just `int`? And what if the value of `Vector.Count` isn't 8 one day? – Elaskanator Dec 21 '21 at 14:14
  • 2
    @Elaskanator Yeah, that's where the unrolled loop falls down. But then, if it decreases one day your code's already in trouble. I assume you're thinking of letting `T` be another integer type? Something like this? https://dotnetfiddle.net/6zwjXW – canton7 Dec 21 '21 at 14:20