2

I wanted to use the AVX-512 instruction in C#, but what I understood is: there is no support for it (or I am extremely bad on searching on internet). So I decided to create my own binding for it. However I'm getting:

External component has thrown an exception.

And I can't figure out what I messed up here.

Here is my C code:

#include <immintrin.h>

__declspec(dllexport) 
__m512i
load_s32(const void *ptr) {
    return _mm512_load_epi32(ptr);
}

which is compiled using following commands:

gcc -c decl.c -mavx512f
gcc -shared -o libavx512.dll decl.o -Wl,--out-implib,libavx512.dll.a -mavx512f

In C# I created a library which contains following part:

using System.Runtime.InteropServices;

using S64 = System.Int64;

namespace AVX512Sharp
{
    [StructLayout(LayoutKind.Sequential, Size = 64)]
    public struct M512S32
    {
        public S64 M0;
        public S64 M1;
        public S64 M2;
        public S64 M3;
        public S64 M4;
        public S64 M5;
        public S64 M6;
        public S64 M7;
    }

    public static class AVX512
    {
        [DllImport("libavx512.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint = "load_s32")]
        public extern unsafe static M512S32 LoadS32(void* ptr);
    }
}

And in my test program I'm using it like this:

int* mem = stackalloc int[16];
for (int i = 0; i < 16; ++i)
    mem[i] = i * 10;

M512S32 zmm0;
zmm0 = AVX512.LoadS32(mem);

I really don't know what I did wrong here.

Notes

  • To test if the binding work I removed the SIMD function:
__declspec(dllexport) 
void
load_s32(const void *ptr) {
    return;
}

and also updated the AVX512 class:

public static class AVX512
{
    [DllImport("libavx512.dll", EntryPoint = "load_s32")]
    public extern unsafe static void LoadS32(void* ptr);
}

this didn't throw an exception.

  • In the second step I tried to use the dll in a C application. which also worked out without any errors.
  • Also tried to use extra attributes such as: -Wl,--export-all-symbols, -Wl,--enable-auto-import. The related doc is here.
  • 2
    Your first step for debugging should be to remove the AVX-specific part from the unmanaged DLL to make sure you've set the interop stuff up correctly. If that works, then use your unmanaged DLL in an unmanaged context to make sure it is fundamentally correct. If after all that, it still doesn't work, _then_ you know you have an actual C# question. – Peter Duniho Apr 24 '21 at 18:59
  • @PeterDuniho But I have done that already (unless I missed something during the testing). –  Apr 24 '21 at 19:05
  • Sorry...I didn't see anything in the post above that described the debugging steps you'd done already. – Peter Duniho Apr 24 '21 at 19:22
  • You have to alloc the memory for the object before call c++. (M512S32 zmm0) zmm0 is null. So in c++ _mm512_load_epi32(ptr); is trying to access a null object. – jdweng Apr 24 '21 at 19:31
  • @jdweng can you explain it a bit more? Why should I allocate something for zmm0? zmm0 is a register, isn't it (If I'm not wrong)? –  Apr 24 '21 at 19:41
  • You can use "new M512S32" to allocate. I just meant is was null meaning the object didn't have memory assigned. – jdweng Apr 24 '21 at 19:54
  • 1
    @jdweng It isn't necessary, however I tried it and it didn't work (also a "default" `struct` can't be null). https://stackoverflow.com/questions/7767669/why-is-it-possible-to-instantiate-a-struct-without-the-new-keyword –  Apr 24 '21 at 19:59
  • Don't think so, because the [doc](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.structlayoutattribute.size?view=net-5.0) says that the `size` should be in `bytes`. –  Apr 24 '21 at 20:29
  • _mm512_load_epi32 memory has to be 64 bit alligned. Your code is only 32 bit alligned (int* mem = stackalloc int[16];) – jdweng Apr 24 '21 at 20:30
  • @jdweng isn't 4*16=64? Or did I misunderstand the comment? –  Apr 24 '21 at 20:35
  • if your memory started at address zero "64 bit align" means starts at one of following : 0, 8,16,24,32,40,48.... An int* just means the data is 32 bits wide. So your 16 integers number could be at addresses 2, 6,10, 14, 18, 22, ... The array is not started at address zero. – jdweng Apr 24 '21 at 20:41
  • @jdweng I'm not sure I understand what you mean. How is the starting point not zero? Note that this array with 8 elements would work with AVX2 (which is 256bit or 32byte or 4*8). So I don't understand your suggestion. Maybe you can write a detailed answer to make it clear? –  Apr 24 '21 at 20:46
  • Do you know what 64 bit aligned means? What is means is the starting address in memory the 3 LSBs are all zero. Or (address % 8) == 0. What you have "stackalloc int[16]" is just means you have 4 * 16 bytes (64 bytes) starting at an address in memory. Not any address with 3 LSBs zero. – jdweng Apr 24 '21 at 20:57
  • @jdweng I think you don't know that it should be 64byte(512bit) aligned. Here is the [link](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined&expand=3301,3298,3344,542,774,3865,2185,4002,2429,2438,94,5147,1386,1386,2941,2941,6024,94,632,2457,6024,2984,3865,3865,3301,3326&techs=AVX_512&text=_mm512). As you can see you can fit 16 `int`s. Again the exact code works in `C`. You can show me an example If I misunderstand your comment. I think with an example we can talk more precisely. –  Apr 24 '21 at 21:09
  • 1
    I suggest to build against x64 both library and the app (not Any CPU), and use Fastcall calling convention. Cdecl is x86 convention which is not applicable to x64. [Some read](https://stackoverflow.com/a/58567658/12888024). – aepot Apr 24 '21 at 21:37
  • What is this supposed to do? If C# doesn't know about AVX512, how would it know to get the result from `zmm0`? – stepan Apr 24 '21 at 22:07
  • Read Description : https://scc.ustc.edu.cn/zlsc/chinagrid/intel/compiler_c/main_cls/GUID-88D03298-7839-4B1B-BD45-32B3378759C2.htm Best way is to put array into a structure like the one you already have. See Pack : https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.structlayoutattribute?view=net-5.0 – jdweng Apr 24 '21 at 22:12
  • 1
    @jdweng: Correction, `_mm512_load_epi32` has to be 64 **byte** aligned not 64-bit; it's a silly alternate name for `_mm512_load_si512` that I recommend never using. ([What is the difference between \_mm512\_load\_epi32 and \_mm512\_load\_si512?](https://stackoverflow.com/q/53905757)). Only use the _epi32 version if you're doing a masked load, because then the element size has meaning. Use `loadu` instead of `load` for unaligned loads, but note that alignment is more important for performance with 512-bit vectors: *every* misaligned vector is a cache-line split, and its a bigger slowdown. – Peter Cordes Apr 25 '21 at 01:11
  • @Hrant: Did you confirm your CPU supports AVX-512, e.g. with a pure C test, maybe with optimization disabled so you can do something simple and have it not optimize away? Or just use `__m512i` at all in code you compile with `gcc -march=native` - that should only work if AVX-512 is supported on the build machine. `-mavx512f` will generate code that uses AVX-512 regardless of whether the current machine supports it or not. – Peter Cordes Apr 25 '21 at 01:13
  • 1
    @PeterCordes Yep, my CPU on my MACHINE didn't have AVX512 support. Working on multiple machines has it's disadvantages. –  Apr 25 '21 at 03:35
  • @PeterCordes : Alignment is also important to prevent memory exceptions. If the alignemt is not consistant with the compiler options you can get errors at the end of a memory block. If the compile is set to 32 byte alignment and in the code have a 64 bit structure that isn't declared properly, you could get an exception. It is not always a performance issue. – jdweng Apr 25 '21 at 07:43
  • @jdweng what do you mean by "consistent"? Can you give me an example what should've been instead of `stackalloc int[16]` in case of `AVX512`? –  Apr 25 '21 at 07:48
  • @jdweng: Yeah, if you write buggy code, it can crash if you're unlucky. :P If you rule out buggy code, then yeah, aligning your data can make it safe to over-read past the end of an array (if C# lets you get away with that), as long as you make sure to ignore those bytes, e.g. with masking. ([Is it safe to read past the end of a buffer within the same page on x86?](//stackoverflow.com/q/37800739)). So a saner way to phrase that is that alignment can let you optimize loop cleanup for the final partial vector. (But often you can load a final unaligned vector that ends at the end of the array). – Peter Cordes Apr 25 '21 at 07:50

1 Answers1

4

I decided to create my own binding for it.

You can't. Best thing you can do instead, write a DLL in C or C++ which uses AVX512, and consume the DLL from C#. If you try to export individual instructions from the DLL, the performance won't be good because memory access, and because pinvoke overhead. Instead, you should write larger pieces of functionality in C.

I really don't know what I did wrong here.

Your C function expects input pointer in rcx register, and returns result in zmm0 vector register.

Your C# function doesn't know about zmm0. The runtime allocates 64 bytes on stack for the return value, passes address of the return value buffer in rcx register, passes input pointer in rdx register, and expects the function to return the pointer passed in rcx in rax register.

The languages on two sides of the interop disagree about the calling convention, and your code crashes in runtime.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • (Also, the querent apparently got their machines mixed up(?) and [was trying this one one that didn't support AVX-512](https://stackoverflow.com/questions/67246318/how-to-get-avx512-in-c/67250468#comment118867962_67246318), so that didn't help. Illegal-instruction fault in the C function, so it crashes that way before even returning. But yeah, wrapping single instructions is a non-starter for performance, even if you do get the calling-convention right.) – Peter Cordes Apr 25 '21 at 06:45
  • @Soonts I was trying to copy what MS did for `AVX2`. I looked into their code using `ILSpy` and they used similar structure there. I do get the data from interop function call (after fixing the calling convetion), but yeah you're right: the performance will not be perfect (or even good) here. –  Apr 25 '21 at 07:29
  • Also `.NET5` introduced "native function pointers" ([doc](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-9.0/function-pointers)). Can someone use them to improve the overhead there? –  Apr 25 '21 at 07:35
  • @Hrant What you see in ILSpy is only the high-level part of the story. Note the `Vector256` structure is marked with `[Intrinsic]` attribute in the CIL. The .NET runtime knows how to handle these types properly: keep them in these vector registers (in case of 32-byte AVX vectors they are named `ymm0` to `ymm15`), when they are in memory align by 32 bytes whenever possible, etc. The runtime doesn’t support AVX512 and treats your `M512S32` as a regular structure type, not as a native vector. – Soonts Apr 25 '21 at 07:53
  • @Soonts I see. So the only hope is on MS? –  Apr 25 '21 at 07:54
  • About these new function pointers, it may slightly reduce the overhead. The pinvoke overhead is very reasonable even today. Before this new function pointers feature, I have measured about 15-20 CPU cycles for simple function call. But when the C code only runs 1 instruction, that overhead gonna dominate the performance. Another thing, because the .NET runtime does not support AVX512, all input and output vectors must be passed in memory as opposed to registers, better pinvoke won’t help with that, you gonna have way too many loads and stored in the code. – Soonts Apr 25 '21 at 07:54
  • 1
    @Hrant If you don’t want to wait, implement your performance-critical functions in C++ and use dllimport. Don’t export individual instructions, export larger functions which stream data from/to memory, and the performance will be fine. – Soonts Apr 25 '21 at 07:57
  • Another option, drop AVX512. Modern AMD processors running AVX2 code can be faster than modern Intel processors running AVX512 code. AMD has more cores, and in Zen3 they even have better single-thread performance. Some algorithms do benefit from AVX512 but the win is never by a factor of 2, typically way smaller. https://blog.milvus.io/milvus-performance-on-avx-512-vs-on-avx2-5dfb08d63a4c https://www.phoronix.com/scan.php?page=article&item=rocket-lake-avx512&num=2 – Soonts Apr 25 '21 at 08:05