Calculating performance of CUFFT

Question

I am running CUFFT on chunks (N*N/p) divided in multiple GPUs, and I have a question regarding calculating the performance. First, a bit about how I am doing it:

Send N*N/p chunks to each GPU
Batched 1-D FFT for each row in p GPUs
Get N*N/p chunks back to host - perform transpose on the entire dataset
Ditto Step 1
Ditto Step 2

Gflops = ( 1e-9 * 5 * N * N *lg(N*N) ) / execution time

and Execution time is calculated as:

execution time = Sum(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

Is this the correct way to evaluate CUFFT performance on multiple GPUs? Is there any other way I could represent the performance of FFT?

Thanks.

cufftZ2Z - am I correct in including the fft plan creation and destruction timings in execution time? I see a considerable difference if I do not include them... — Sayan, Feb 18 '12 at 10:58
There is no "correct" answer to that. You should report precisely what your timings include. Plan creation might include lazy runtime API context establishment. You probably don't want that if it does. I don't really use CUFFT and don't know much about its internals. — talonmies, Feb 18 '12 at 11:28
Out of curiosity, why do you have 5 * 1e-9 (the question is for the "5" part) — , Feb 23 '12 at 00:01
@CarlodelMundo: The operation count of a complex FFT of length N is `5 N log2(N)` (this is where the 5 comes from). The `1e-9` is a conversion factor from FLOP/s to GFLOP/s. — talonmies, Feb 29 '12 at 14:30

talonmies · Accepted Answer · 2012-02-18T07:16:02.457

2

If you are doing a complex transform, the operation count is correct (it should be 2.5 N log2(N) for a real valued transform), but the GFLOP formula is incorrect. In a parallel, multiprocessor operation the usual calculation of throughput is

operation count / wall clock time

In your case, presuming the GPUs are operating in parallel, either measure the wall clock time (ie. how long the whole operation took) for the execution time, or use this:

execution time = max(memcpyHtoD + kernel + memcpyDtoH times for row and col FFT for each GPU)

As it stands, your calculation represents the serial execution time. Allowing for the overheads from the multigpu scheme, I would expect that the calculated performance numbers you are getting will be lower than the equivalent transform done on a single GPU.

edited Feb 18 '12 at 07:16

answered Feb 18 '12 at 06:18

talonmies

70,661
34
192
269

Thank you, I get it where I went wrong. Will it be correct to say that if data transfer is included, then GPU performance is comparable to that of n threads in a CPU? – Sayan Feb 18 '12 at 10:56
Sorry I don't understand what you are trying to ask. – talonmies Feb 18 '12 at 11:11
I noticed that if I include the memcpyHtoD/DtoH times in `execution time`, the gflops of gpu and fftw on multiple threads on cpus are close; since my aim is to compare the cpu performance of fft with that of gpu, so I asked. – Sayan Feb 18 '12 at 21:40

Calculating performance of CUFFT

1 Answers1

Linked

Related