I need to do arithmetic operations on 256-bit unsigned integer and I need a fast implementation. AFAIK, SIMD instructions do not help because of carry propagation (Can long integer routines benefit from SSE?) and Intel's ADX instructions can help.
How can I utilize Intel's ADX instructions to make addition and multiplication faster?