I've been writing some cython code to implement multi-precision array operations (mostly dot products and matrix inversion) that I want to use in python. I used mpfr as the underlying C library and by testing both in C and in Cython I find mpfr (at 200 bits precision) to be 50-200 times slower (depending on the operation) than numpy (at machine precision). I know mpfr is very fast but i still find this overhead to be surprisingly large. Since my needs are very limited (fixed precision, only basic operations such as add, mult, etc..) I was wondering if I could just hand-code some multi-precision operations (disregarding careful rounding, etc..). Unfortunately this involve quite a lot of work so I was hoping to find some free code snippets in C or intel assembly for doing basic multi-precision arithmatic. I would appreciate any references to the latter or reasons why I should or should not take this approach.
UPDATE: I should have mentioned I've already tried the QD library and its actually (slightly) slower than MPFR at similar precision (212 bits). I guess this must be due to C++ overhead.