While processing a vector with 1,000,000 elements I tried printing the global ID every 10,000 iterations to monitor progress in development by adding these lines to the kernel:
"#pragma OPENCL EXTENSION cl_amd_printf : enable \n" \
and
" if(id % 10000 == 0){ \n" \
" printf(\"%d\\r\\n\", id); \n" \
" } \n" \
That resulted in normal 3.0-3.3 second execution bloated into 38-40 seconds. As I could not find any mention of performance in the section A.8.10 of AMD OpenCL 3.0 SDK, it is not immediately clear if this behavior is normal.
Is this performance hit normal and expected, or am I doing anything wrong?