CUDA speed optimization

Question

I have developed an application in CUDA for modular exponentiation and it performs very well for 512-bit integers. This multi precision integers are stored in 16 32-bit words.
Some concepts I use in order to achieve 2.5 - 3.2 speedup comparing to OpenSSL modular exponentiation approach:

__shared__ memory
CUDA memory align
PTX code for 32-bit addition, multiplication
unrolling

All good by now, but trying to extend the integers to 1024 bits, the performance decreases dramatically to 0.1 - 0.3, and the only difference is the memory size needed to store an integer - now 32 x 32-bit words. Not to mention the 2048-bit version which is hundreds of times slower.

I have to say that when I want to compute 1000 modular exponentiations (r = a^x mod n), for example, I just send all the operands to my kernel, that means 512000 Bytes of memory.
My question: Why this minor changing is influencing the performance so much?
I use Nvidia Geforce GT 520mx, Ubuntu 14.04 64-bit.

Without a [mcve] it is hard/impossible to tell. One point you might want to look in is register spilling. But this is just speculation.... — havogt, Apr 08 '16 at 07:29
Register spilling would also have been my first guess. Compile the PTX with ` nvcc -Xptxas=-v ...` to receive information about the register usage, maybe it helps. — Marco13, Apr 08 '16 at 09:25

score 2 · Accepted Answer · edited May 23 '17 at 12:23

2

Hard to tell without a minimal testing source code, but you could run into several limitations while increasing the size of your data :

Registers
Shared memory / L1 cache
Occupancy

And maybe a lot of others that I am forgetting.

Profiling your application could be very, very helpful. If you use Visual Studio, Nvidia NSIGHT can analyse the execution of your application and give you a lot of helpful information :

Blocks, threads, warps
Device theoretical and achieved occupancy
Multiprocessors activity
etc

And even draw some charts for you to view easily where is your bottleneck.

See my answer here on how to make Nsight run and analyse your application for performance analysis.

edited May 23 '17 at 12:23

Community

1
1

answered Apr 08 '16 at 10:07

Taro

798
8
18

thanks for suggestions. I have the possibility to test it with Nsight. Even a minimal example would be too big to post here and also difficult to understand. I will try to use Nsight and I will come back with an answer. – Dani Grosu Apr 08 '16 at 13:01
@DaniGrosu If you have the possibility to try with Nsight, then yes do it ! Surely you will get additional helpful info. – Taro Apr 08 '16 at 13:52

CUDA speed optimization

1 Answers1