I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. First, a bit about how I am doing it:
- Send N*N/p chunks to each GPU
- Batched 1-D FFT for each row in p GPUs
- Get N*N/p chunks back to host - perform transpose on the entire dataset
- Ditto Step 1
- Ditto Step 2
Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / execution time
and Execution time is calculated as:
execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)
Is this the correct way to evaluate CUFFT performance on multiple GPUs? Is there any other way I could represent the performance of FFT?
Thanks.