How can I access math coprocessor from C# code? I would like to make some calculations on integers as fast as it's possible. I know it's possible under C++ compliers to use Assembler code inside it, but what about .Net?
-
Can you give us an example of what you're trying to do. In most situations the code emitted by the JIT compiler is fast enough. – Steven Mar 17 '11 at 21:56
-
5BTW, math as stopped being in hardwired in a *co*processor since 80486DX in 1989. IIRC the 80387 coprocessor was about FP arithmetics, not integer ;-) – Serge Wautier Mar 17 '11 at 22:05
5 Answers
The JIT compiler knows about the math coprocessor and will use it. What you really want is to use the SIMD engine, not the math coprocessor. This was part of the promise of JIT-compilation, that the runtime could pick the fastest hardware acceleration available on each computer, but I don't think .NET actually does that, at least in v4.
Or are you using the term "math coprocessor" to mean something other than the x87 FPU? There are some FPGA boards marketed as accelerator/coprocessor systems. If that's what you mean, you'll need to consult the programming manual that comes with the particular product. There are no special CPU instructions for accessing those, inline assembler wouldn't be helpful in this case.
For example, the GPU is even faster at math on large datasets than the CPU's SIMD engine, and you can access that from .NET using DirectX Compute Shaders (or p/invoking OpenCL), no assembler required.

- 277,958
- 43
- 419
- 720
-
1Ben is correct. The runtime hardly differentiates between processor architectures. It doesn't benefit from the newest processor instructions. The reason for this is that it would make it much harder for Microsoft to test the framework on all those processors and would make it much harder for Microsoft support to reproduce problems, because the JITted code would be different from processor to processor. – Steven Mar 17 '11 at 22:07
-
Yes, I mean X87, thank You for Your way helpful answer. I'll try with DirectX and compare it with C# – Piotr Salaciak Mar 17 '11 at 22:11
I don't think that this would be possible to do directly from managed code. You could still call unmanaged code which does those calculations but whether the cost of interop marshaling is worth it is difficult to say. You will have to minimize it as much as possible and do all the calculations in unmanaged code and do only a single call to minimize overhead.

- 1,023,142
- 271
- 3,287
- 2,928
-
4I've worked on a project in the past where we did loan calculations. One of the developers moved the calculations to C++, because he thought C# was too slow. The calculations in C++ were actually not must faster, but the marshaling killed us. The irony of it all is that the real performance problem was actually the hundreds of database queries executed during the calculations that took up about 98% of the time :-) – Steven Mar 17 '11 at 21:59
-
@Steven: Sure, but I bet you found your bottleneck really easily. Math code in C# and C++ might not be recognizably different, but I bet people noticed the pain of porting db access apis. – Ben Voigt Mar 17 '11 at 22:04
No, you cannot directly use inline assembler in C# managed code.
Your best bet is to make sure your general approach/algorithm is clean and efficient, and your math operations are clean and efficient, and then rely on the compiler to make efficient use of the available coprocessor.

- 65,341
- 71
- 269
- 466
-
4I say: go parallel! Calculations are often very good candidates for parallelization. The cleaner the algorithm, the easier it is to parallelize. – Steven Mar 17 '11 at 22:02
-
@Steven: Sounds like a great recommendation, assuming it makes sense within the context of what the OP is doing. – Jonathan Wood Mar 17 '11 at 22:06
-
Of course. We currently have not enough information to see whether parallelization works in his situation. Just an educated guess :-) – Steven Mar 17 '11 at 22:08
This is not natively supported by C# as a language, nor .NET as a framework.
If you need that kind of speed or prowess, use something else altogether.

- 44,454
- 10
- 85
- 129
I know this is an old post, but for those coming here for similar reason of speeding up maths operations, for example a large number of vector operations. To get the greatest speed from C# in maths you should convert your formulae to the logarithmic equivalent. This takes some practice, but once you have the idea you can do it with every formulae. Then decide to keep your values in log form, only converting to human readable form for those values the user needs to see.
The reason logs work faster is because they are all addition and subtraction (subtraction just being the addition of a compliment number), your processors can do these in large numbers with ease.
If you have not done this sort of maths before there are lessons online that will lead you through it, it has a learning curve but for maths/graphics programmers the learning curve is worth it.

- 105
- 1
-
FP multiply has identical performance to FP add/sub (both throughput, latency, and uops) on most modern x86 CPUs, such as Skylake or Ice Lake. Or are you talking about integer? I guess this would help with division, since div and sqrt are slow (higher latency and much worse throughput), if that's what you mean? – Peter Cordes Oct 24 '21 at 20:18
-
No not really, the values seem small but in real time processing they are significant. This link is to an old post, but the percentage differences still apply. http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ In addition (poor place for a pun), addition/(subtraction) of logs allows for serializing the data being added. – Bob Oct 26 '21 at 06:07
-
Sorry, I should have linked https://uops.info/ to back up my point. Agner Fog's instruction tables (which your link cites) have the same numbers. Your link seems to just show throughput as the low end of the "cost" range, with latency as the high end, which is maybe useful for readers who don't understand pipelined CPUs and instruction-level parallelism I guess. Intel since Skylake has had literally identical performance for `vaddpd` and `vmulpd`, 4c latency, 0.5c throughput, single uop. (And a compiler can "contract" a combination of multiply and add into an FMA.) – Peter Cordes Oct 26 '21 at 06:25
-
So basically I'm saying that the table you linked is wrong about FP add vs. FP mul. On Haswell, FP add is 3 cycle latency (1c throughput), FP mul is 5 cycle latency (0.5c throughput), so interestingly FP add had worse throughput despite better latency. [Why does Intel's Haswell chip allow floating point multiplication to be twice as fast as addition?](https://electronics.stackexchange.com/a/452366) (Also, lumping integer SIMD mul in with FP mul pushes up the top of the range, e.g. it should be 10 cycle latency for `vpmulld` on Haswell and later.) – Peter Cordes Oct 26 '21 at 06:29
-
Basically, trying to reduce things to simple costs you can add up is a fools errand, [that's not how pipelined OoO exec CPUs work](https://stackoverflow.com/questions/692718/how-many-cpu-cycles-are-needed-for-each-assembly-instruction/44980899#44980899), although I've [attempted it myself in a Q&A answer](https://gamedev.stackexchange.com/questions/27196/which-opcodes-are-faster-at-the-cpu-level/104534#104534). – Peter Cordes Oct 26 '21 at 06:31