2

I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. First, a bit about how I am doing it:

  1. Send N*N/p chunks to each GPU
  2. Batched 1-D FFT for each row in p GPUs
  3. Get N*N/p chunks back to host - perform transpose on the entire dataset
  4. Ditto Step 1
  5. Ditto Step 2

Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / execution time

and Execution time is calculated as:

execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

Is this the correct way to evaluate CUFFT performance on multiple GPUs? Is there any other way I could represent the performance of FFT?

Thanks.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
Sayan
  • 2,662
  • 10
  • 41
  • 56
  • Is this a for a real or complex transform? – talonmies Feb 18 '12 at 05:19
  • cufftZ2Z - am I correct in including the fft plan creation and destruction timings in execution time? I see a considerable difference if I do not include them... – Sayan Feb 18 '12 at 10:58
  • 1
    There is no "correct" answer to that. You should report precisely what your timings include. Plan creation might include lazy runtime API context establishment. You probably don't want that if it does. I don't really use CUFFT and don't know much about its internals. – talonmies Feb 18 '12 at 11:28
  • 1
    Out of curiosity, why do you have 5 * 1e-9 (the question is for the "5" part) –  Feb 23 '12 at 00:01
  • @CarlodelMundo: The operation count of a complex FFT of length N is `5 N log2(N)` (this is where the 5 comes from). The `1e-9` is a conversion factor from FLOP/s to GFLOP/s. – talonmies Feb 29 '12 at 14:30

1 Answers1

2

If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is

operation count / wall clock time

In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:

execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thank you, I get it where I went wrong. Will it be correct to say that if data transfer is included, then GPU performance is comparable to that of n threads in a CPU? – Sayan Feb 18 '12 at 10:56
  • Sorry I don't understand what you are trying to ask. – talonmies Feb 18 '12 at 11:11
  • I noticed that if I include the memcpyHtoD/DtoH times in `execution time`, the gflops of gpu and fftw on multiple threads on cpus are close; since my aim is to compare the cpu performance of fft with that of gpu, so I asked. – Sayan Feb 18 '12 at 21:40