I have a pyopencl program that make a long calculation (~3-5 hours per run). I have several kernels started one by one in cycle. So I have something like this:
prepare_kernels_and_data()
for i in range(big_number): # in my case big_number is 400000
load_data_to_device(i) # ~0.0002s
run_kernel1(i) # ~0.0086s
run_kernel2(i) # ~0.00028s
store_data_from_device(i) # ~0.0002s
I measured time and I got following:
- System time is 4:30 hours (measured by linux
time
command) - Pure opencl event-based timing is 3:30 hours (load+calculate+store)
I'd like to know:
- How big is minimal overhead for OpenCL program? In my case it is like 35%
- Should I trust event-based timings?
- Does enabling profiling add some significant time to whole program execution time?
I know that overhead is depend on program and I know that python isn't as fast as pure C or CPP. But I believe that when I move all my heavy calculations to OpenCL kernels I can loose no more than 5-7%. Please correct me if I wrong.
P.S. AMD OpenCL, AMD GPU