I have developed an application in CUDA for modular exponentiation and it performs very well for 512-bit integers. This multi precision integers are stored in 16 32-bit words.
Some concepts I use in order to achieve 2.5 - 3.2 speedup comparing to OpenSSL modular exponentiation approach:
__shared__
memory- CUDA memory align
- PTX code for 32-bit addition, multiplication
- unrolling
All good by now, but trying to extend the integers to 1024 bits, the performance decreases dramatically to 0.1 - 0.3, and the only difference is the memory size needed to store an integer - now 32 x 32-bit words. Not to mention the 2048-bit version which is hundreds of times slower.
I have to say that when I want to compute 1000 modular exponentiations (r = a^x mod n
), for example, I just send all the operands to my kernel, that means 512000 Bytes of memory.
My question: Why this minor changing is influencing the performance so much?
I use Nvidia Geforce GT 520mx, Ubuntu 14.04 64-bit.