1

I'm looking into Vector<T> in the System.Numerics.Vectors namespace from version 4.5.0-preview1-26216-02. MSDN documentation says:

Vector<T> is an immutable structure that represents a single vector of a specified numeric type. The count of a Vector<T> instance is fixed, but its upper limit is CPU-register dependent.
https://learn.microsoft.com/en-us/dotnet/api/system.numerics.vector-1 (emphasis added)

Even overlooking the misguided wording "count [sic.] of a Vector", this sentence seems quite unclear, since it implies that different Vector<T> instances might have different--albeit "fixed" up to some CPU-limit--"counts" (again, so-called 'count' of what, exactly?) There's no mention of an actual Count property here--or in fact anywhere on the intro page).

Now normally, I think "read-only" or "immutable" are more conventionally used than "fixed" for describing an instance's properties or fields, but in this case it turns out that the Vector<T>.Count property, while indeed read-only, is also static, and thus in no-way associated with any Vector<T> instance. Instead, its value varies only according to the generic type argument T (and then presumably from machine-to-machine, as indicated):

bool hw = Vector.IsHardwareAccelerated;    // --> true

var c = (Vector<sbyte>.Count, Vector<short>.Count, Vector<int>.Count, Vector<long>.Count);
Debug.WriteLine(c);    // -->  (16, 8, 4, 2)

Oh.

So is it basically System.Int128 in disguise? And is that it? My questions are:

  • Am I missing something? It's true that I knew little nothing about SIMD, but I thought this library would allow the use of much wider hardware-accelerated datatypes than just 128 bits. My HPSG parsing engine routinely performs intensive bitwise computations vectors of 5,000+ bits.
  • Again assuming I'm not missing the point, why not just call it System.Int128 / System.UInt128 instead of Vector<T>? Parameterizing it with the generic primitive types does pose certain benefits, but gave me the wrong idea that it was more a usefully expansive array (i.e., namely, of blittable elements T), as opposed to just a double-width CPU register, which to my mind is about as "scalar" as you can get.

    Don't get me wrong, a 128-bit register is interesting, useful, and exciting stuff--if just a bit oversold here maybe? For example, Vector<byte> is going to have 16 elements no matter what, regardless of whether you need or use them all, so the spirit of Count which is expected to vary by instance at runtime seems to be misapplied here.

  • Even though a single Vector<T> won't directly handle the use case I described as I was hoping, would it be worth it to update my current implementation (which uses a ulong[N >> 6] array for each N-bit vector) to instead use the Vector<ulong>[N >> 7] array?

    ...yes, that's "array of Vector<ulong>", which again seems strange to me; shouldn't a type with "Vector" in its name be sufficiently or usefully extensible without having to explicitly create an array to wrap multiple instances?

  • Beyond the fact that each 128-bit SIMD bitwise operation is processing twice as much data, are SIMD bitwise operations additionally faster (or slower) in cycles per opcode as well?
  • Are there other hardware platforms in common use or availability today where System.Numerics.Vectors actually reports a different SIMD bit-width?
Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108
  • I know little about SIMD as well, but the byte count is limited by hardware. I assume the library could handle much larger vectors with available hardware. – glenebob Apr 05 '18 at 01:34
  • you might want to have a look at what's possible with SIMD registers (SSE, AVX, AVX2, AVX512) at https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=fma&techs=AVX2,FMA Keep in mind that Vector is CPU independent, so it will work on AARCH64 as well with NEON instructions. – Regis Portalez Apr 06 '18 at 08:06
  • @RegisPortalez Oh interesting, I wasn't aware of additional hardware targets with `System.Numerics.Vectors`. Is that support integrated in current .NET versions and/or the same code branch of the library? What are the reported `Vector` sizes on those platforms? – Glenn Slayden Apr 10 '18 at 22:43
  • That’s the claimed goal at least. Size will be vector registers size – Regis Portalez Apr 11 '18 at 06:19

1 Answers1

9

The vector size is not always 16 bytes, though that is very common. For example on a platform with AVX2, programs run in 64bit mode get 32 byte vectors. In this way the Count property can also vary on the same machine (for the same T), by running the program in different modes. In principle it wouldn't have to be like that, a 32-bit program could still use 256-bit operations even with just AVX1 support, but that's not how System.Numerics.Vectors works. The varying size per feature-level of the CPU is a fairly fundamental part of the design of the API, presumably to enable some form of future-proofing, though it perhaps contributed to the lack of shuffles (which would be hard to specify for a vector of not-statically-known size).

I thought this library would allow the use of much wider hardware-accelerated datatypes than just 128 bits

That doesn't exist in the hardware so it would be hard to offer. AVX-512 goes up to 512 bits as the name implies, but that's as far as SIMD on mainstream CPUs goes for now.

why not just call it System.Int128 / System.UInt128

I would expect those types to map to actual integer types, not vector types. Many operations that would make sense on a 128-bit integers do not actually exist as CPU instructions, and almost all operations that do exist operate on 2 × 64 (Vector<long>, long[2]), 4 × 32 (Vector<int>, int[4]), 8 × 16 (Vector<short>, short[8]) or 16 × 8 (Vector<byte>, byte[16]) bit vectors (or double those widths on platforms that support it). Offering "byte-wise add" operation on an Int128 would be strange, and not offering true 128-bit addition makes it even stranger. Besides as mentioned earlier, the size is not by definition 128 bits, that's just common.

Many SIMD operations are quite fast, though there are some exceptions. 32-bit multiplication typically has a rather extreme latency for example. The System.Numerics.Vectors API also allows some non-existent operations (that must be emulated slowly, such as integer division or byte multiplication) without hinting that there is a problem. Operations that map to instructions that actually exist are mostly fast.

While bitwise operations on ulong are also fast, their vector versions are even better when viewed in terms of "total work done per unit time." For example, Skylake can execute (at best) four scalar bitwise operations per cycle (but extra operations like an addition and compare/branch to make a loop would compete over the same resource) but executes three 256-bit bitwise operations with SIMD which is 3 times the amount of work in the same time and it leaves an execution port open for a scalar operation or branch.

So yes, it's probably worth using. You could keep the array of ulong and use the construct-from-array constructor of Vector<T>, that way you don't have to deal with vectors everywhere. For example, indexing into a vector with a variable index is not a nice operation at all, causing a branch, vector store and scalar reload. The variable-size nature of the vectors obviously also significantly complicates using arrays of them directly rather than using array of primitive types and then vector-loading from them. You could easily round up the length of the array to a multiple of the vector count though, to remove the need for a small scalar loop to handle to remaining items at the end of an array that don't quite fit in a vector.

Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108
harold
  • 61,398
  • 6
  • 86
  • 164
  • 1
    Geez, I know "thank you"-type posts are somewhat discouraged, but what a *fantastic* answer! Detailed, informative, courteous, and just the specific helpful info I was looking for. Immediate +1 and accept. – Glenn Slayden Apr 05 '18 at 01:49
  • "Indexing into a **vector** is not a nice operation at all..." Since you said *vector*, are you referring to some kind of CPU/SIMD indexing to obtain a subsection of the SIMD register for some reason... or rather standard .NET array indexing into the putative array in order to perform the SIMD load in the first place? If the former, I hope you don't mean bit-testing, which could feel like 'indexing' a subsection. If the latter, then isn't there also CLR range checking on the array (or does your tally include that already) and then also wouldn't that overhead be same for the `ulong[]` case? – Glenn Slayden Apr 05 '18 at 02:12
  • @GlennSlayden I meant into the `Vector` even if it comes from an array (both indexes are subject to bounds check, though BCE may apply), the store/reload can be optimized out depending on context (roughly, if the scalar load can be done directly from memory, it is - it doesn't become a load/store/load triple-whammy) – harold Apr 05 '18 at 02:20
  • Ok sorry then, what use-case needs to fetch what sub-unit of the SIMD register and why? As I said, my use case is pretty much limited to ridiculous-width bitwise gang ops: AND, OR, NOT, XOR. Hamming weight would be a gigantic bonus too. – Glenn Slayden Apr 05 '18 at 02:25
  • For example a hypothetical bad case (which is not actually necessary but someone might write it) is a nested pair of loops, one over the array and then one over the vectors that are elements of the array. That is bad enough already, it suddenly and unexpectedly gets much worse if the element of the array is loaded into a local variable and the inner loop indexes into that. In most peoples' mental model that feels like fewer operations, but the codegen for that is significantly worse – harold Apr 05 '18 at 02:28
  • Right. This just bolsters the point of the original question that *these aren't really 'vectors' of those (non-single-bit) blittable types* at all in the first place. The bad thinking you mention is only encouraged by calling this thing `Vector`. – Glenn Slayden Apr 05 '18 at 02:31
  • "You could easily round up the length of the array to a multiple of the vector count though..." Does this imply that a (i.e. 128-bit) SIMD load might require better than 64-bit alignment (which would surprise me on an x64)? Because otherwise you could manipulate the last quad via a SIMD load that overlaps with half of its *lower* neighbor... – Glenn Slayden Apr 05 '18 at 02:39
  • Overlap with the lower neighbor is OK but inconvenient – harold Apr 05 '18 at 02:43
  • 1
    @GlennSlayden btw for Hamming weights there is a [nice paper](https://arxiv.org/pdf/1611.07612.pdf) but I couldn't implement it with System.Numerics.Vectors, much of it is doable but the lack of shifts makes horizontal addition more or less impossible. At least I couldn't figure out it. Still, that technique may be used to reduce the number of scalar popcnts. – harold Apr 05 '18 at 03:04
  • That paper was brilliant btw, thanks for pointing it out. It was easy to port their near-best carry-save adder (CSA) code to (non-SIMD) C# and play with its competitive performance characteristics vs. other hamming weight methods. Their method requires considerably huge blocks to start winning, probably even quite beyond the empirical cases (~6000 bits) I described earlier (again, non-SIMD here) – Glenn Slayden Oct 05 '18 at 04:54
  • Did that "shuffles" link used to go somewhere other than the top-level "Intel® C++ Compiler 19.0 Developer Guide and Reference" it currents goes to? And BTW, the author of the popcnt paper you linked has C++ code on GitHub. https://github.com/WojciechMula/sse-popcount. So for large arrays you could just have C# call native code you created with a C++ compiler, especially if you have dispatch for Ryzen (scalar `popcnt` is best, I think), Haswell+ (AVX2 `vpshufb` LUT or maybe Harley-Seal if that's a win), or AVX512 (`vpternlogd` for Harley-Seal). – Peter Cordes Apr 21 '19 at 19:56
  • @PeterCordes Yes, that's possible, but calling from managed code (C#) for this has two problems. The less-serious is that managed-to-native calls are *very* expensive, so there better be either a gigantic efficiency win—or a lot to do—once there. Tends to foil tasks that are not by nature batchable. The far worse problem, however, which doesn't apply to all native call scenarios but I suspect probably would here, is that you'd be taking a x86 vs. x64 architecture dependency for which .NET has notoriously poor (i.e., absent) runtime resolution support. – Glenn Slayden Jun 26 '19 at 23:33
  • @GlennSlayden: Right, of course you can't call x64 code from 32-bit; managed JITed machine-code has to be for the same ISA as the ahead-of-time compiled code. That would rule out 64-bit scalar `popcnt` for a 32-bit .NET program, but otherwise there's no problem. Having only eight YMM or ZMM registers isn't a huge handicap for any of the SIMD algorithms, I think, so if you have to use 32-bit .NET code, you can make 32-bit unmanaged code to call from it. – Peter Cordes Jun 26 '19 at 23:48
  • But if you were able to get non-terrible SIMD machine code out of the JIT compiler then there's probably not a huge ratio between optimal machine code vs. `Vector`, not enough to pay back the up-front cost in most cases. With very huge arrays, or if the ratio is larger than we expect, it could still be worth it. (e.g. on a CPU with AVX512VPOPCNT if .NET can't use that directly.) If you have time, maybe do a microbenchmark of just a pure C++ program vs. a pure C# / .NET version that does the same work. (Make sure you init the data so lazy mem alloc doesn't give unrealistic cache hits, etc.) – Peter Cordes Jun 26 '19 at 23:52
  • @PeterCordes Yeah, no problem to create separate 32 vs. 64 DLLs with their respective code and calling conventions, etc. of course, but as you know the hacky part is having to ship/maintain both and somehow to bind the correct one at runtime (or app install-time?)... There are ways to do it, but they're all pretty ugly. – Glenn Slayden Jun 27 '19 at 00:25
  • @GlennSlayden: oh, I wasn't aware it was inconvenient. I don't really do Windows. I guess if you have a .NET program that ships as bytecode and can JIT to either 32 or 64-bit depending on the target system, you would need to ship both DLLs. I'd have expected that there would be a way to have 32 and 64-bit DLLs with different builds of the same code, like for shared libraries on Linux where you typically have `/lib` and `/lib64`, or equivalent `LD_LIBRARY_PATH` within the install directory for an application, and the same filename inside those. But even that's not as easy as 1 file, or esp. 0 – Peter Cordes Jun 27 '19 at 00:40
  • 1
    @PeterCordes Yeah, no; if you think about it, it's kinda the whole point of .NET to avoid all that in the first place, so the whole design inherently resists the workarounds. What .NET more elegantly "wants" to do here is allow for user-definable IL opcodes or JIT extension points. – Glenn Slayden Jun 27 '19 at 02:33
  • @harold You wrote: "much of [Harley-Seal CSA] is doable but the lack of shifts makes horizontal addition more or less impossible..." The H/S algorithm you pointed out only requires left-shifts, and it looks like the JIT recognizes multiplication by (statically-analyzable) powers of two as such, for the purposes of `Vector` and SIMD. So as far as I can tell, that shouldn't be a showstopper? (Note that I could be way off base here...) – Glenn Slayden Aug 22 '19 at 01:21
  • @GlennSlayden I was worried about the function referred to in the linked paper as `count(__m256i)` (aka popcnt, horizontal addition at the bit level), I still don't know how to implement that reasonable. The obvious ways to implement it include right-shifts, the trick used in the paper involves `pshufb` but `Vector` can't do that either. On second thought I suppose testing every bit and summing the comparison masks would work, though that's not very nice. I hope there's something better – harold Aug 22 '19 at 01:49
  • Oh, I was talking about their method `harley_seal(...)` in Figure 8, which doesn't have those problematic dependencies. I generally don't see any impediments to a `Vector` implementation of the CSA bit counter in Fig. 8 (*i.e.*, using `CSA()` from Figure 6). – Glenn Slayden Aug 22 '19 at 03:58
  • @GlennSlayden it has `count` in the loop, the CSAs aren't the problem but without `count` the CSAs alone don't do the trick - or they *could* do but then the transposed accumulator needs enough bits to hold the result – harold Aug 22 '19 at 04:18
  • @harold You are entirely correct. I didn't fully appreciate that fact. Thanks. – Glenn Slayden Sep 03 '19 at 19:37