Standard functions such as time
often have a very low resolution. And yes, a good way to get around this is to run your test many times and take an average. Note that the first few times may be extra-slow because of hidden start-up costs - especially when using complex resources like GPUs.
For platform-specific calls, take a look at QueryPerformanceCounter
on Windows and CFAbsoluteTimeGetCurrent
on OS X. (I've not used POSIX call clock_gettime
but that might be worth checking out.)
Measuring GPU performance is tricky because GPUs are remote processing units running separate instructions - often on many parallel units. You might want to visit Nvidia's CUDA Zone for a variety of resources and tools to help measure and optimize CUDA code. (Resources related to OpenCL are also highly relevant.)
Ultimately, you want to see how fast your results make it to the screen, right? For that reason, a call to time
might well suffice for your needs.