Fast C code for simple fixed-(multi)-precision operations (add, div, mul, sub)?

Question

I've been writing some cython code to implement multi-precision array operations (mostly dot products and matrix inversion) that I want to use in python. I used mpfr as the underlying C library and by testing both in C and in Cython I find mpfr (at 200 bits precision) to be 50-200 times slower (depending on the operation) than numpy (at machine precision). I know mpfr is very fast but i still find this overhead to be surprisingly large. Since my needs are very limited (fixed precision, only basic operations such as add, mult, etc..) I was wondering if I could just hand-code some multi-precision operations (disregarding careful rounding, etc..). Unfortunately this involve quite a lot of work so I was hoping to find some free code snippets in C or intel assembly for doing basic multi-precision arithmatic. I would appreciate any references to the latter or reasons why I should or should not take this approach.

UPDATE: I should have mentioned I've already tried the QD library and its actually (slightly) slower than MPFR at similar precision (212 bits). I guess this must be due to C++ overhead.

Comparing 200 bits precision to machine precision is comparing apples to oranges. A speed penalty of 50-200 times doesn't seem unusual to me. I doubt you'll beat it even if you only want to support "simple" operations like multiply. — MatthewD, May 08 '13 at 11:05
I do not understand why some users are voting to close this question. What is wrong with it? The OP is using MPFR but finds it's more than s/he needs. The OP is very clear that the reason s/he is expecting that it's possible to find faster for him/her is that s/he is not using all the features MPFR offers. As it turns out, others have already felt the same need, and there even exists at least one ready-to-use alternative. — Pascal Cuoq, May 08 '13 at 11:57
@MatthewD A double-double addition requires a few double additions to compute. If a FMA instruction is available (as found in PowerPC processors and very recent Intel/AMD desktop ones), a double-double multiplication only takes a few instructions too. — Pascal Cuoq, May 08 '13 at 12:03
Thanks Pascal, that was exactly my reasoning. I've seen (on another SO thread) where someone gave some CUDA assembly instructions for 128-bit operations and they claimed it was ~16 instructions so I expect that order of slowdown rather than 100-200 times. Even using 120 bits I get a significant slowdown (60-100x). — user2153813, May 08 '13 at 13:40
Try profiling your program to find out which MPFR functions are using the most time. — brian beuning, Jul 09 '13 at 23:58

score 4 · Answer 1 · edited May 23 '17 at 12:03

4

You could try a double-double or quad-double library. These libraries take advantage of existing double-precision hardware for speed (I wrote a summary as part of a question of my own). There seems to be code for the latter.

These libraries require the underlying hardware to operate exactly as mandated by the IEEE 754 standard. They break down if computations are made with excess precision. If you target a modern desktop processor, make sure your compiler generates SSE2 instructions for the floating-point computations. If you are stuck with 8087 instructions for some reason, you are better off using a double-double-extended library (numbers represented as the sum of two 80-bit numbers). There is one within CRlibm that should come out without too much work.

Alternately, it may be worth trying GMP's MPF type. It could be faster since it does not try to be as nice as MPFR according to the latter's FAQ.

edited May 23 '17 at 12:03

Community

1
1

answered May 08 '13 at 11:46

Pascal Cuoq

79,187
7
161
281

1

I have not studied this carefully by I assume rounding issues only affect that least significant bits correct? If so then they are mostly irrelevant to me as I just need to increase precision (beyond machine precision) but can afford to have rounding errors on the least significant bits. I actually did test QD along with MPFR and it was always _slower_ (with MPFR at 212 bits). This is a testament to MPFRs impressive optimizations (or QD's C++ overhead). – user2153813 May 08 '13 at 13:44
@user2153813 Double-double or quad-double libraries do not try to provide a perfect IEEE 754-style floating-point arithmetic. In particular, one trick is to avoid renormalization on some intermediate steps, trading performances for a few bits of accuracy (the programmer needs to track the number of bits lost him/herself). This trick is used in CRlibm. Investigating your question, I found this, which suggests that GMP's MPF type **might** be faster than MPFR, since it does not try to be as nice: http://www.mpfr.org/faq.html#mpfr_vs_mpf – Pascal Cuoq May 08 '13 at 14:24
MPF might give a bit of an improvement but from googling the difference is maybe 20% whereas I'm looking for x3-6 speedup. One big issue is that both MPFR and GMP use arbitrary precision so internally store the number as a pointer to an array. Since I can fix the precision at compile time I can store the number inside a struct. This might be very important when doing large dot products because all access will be in contiguous memory. I want something like QD but for some reason QD is slower than MPFR. So I'm now just looking for some C code to get me started writing my own. – user2153813 May 08 '13 at 17:00
To clarify the above: I want some simple baseline code to evaluate what the overhead of MPFR is to determine if its worth writing my own MP funcs which I can moreover optimize for array access (i.e. MP dot products). I could just start with the algorithm in the qd.pdf file you pointed to but I was hoping to start from some actual code. I'll checkout CRlibm. Thanks for all the suggestions! – user2153813 May 08 '13 at 17:03

Fast C code for simple fixed-(multi)-precision operations (add, div, mul, sub)?

1 Answers1