For illustrative purposes, let
__device__ void distance(char *s1, char* s2)
be the device function, which is run over across several blocks and threads compute<<<1024,256>>>(s1, s2, s3)
.
We can assume char *s1
and char *s2
are generated prior to issuing CUDA instructions, and that they are constant throughout execution of all kernels. Is there a way to allocate s1
and s1
such that transferring them to all threads is optimized? Is using __const__
declaration an appropriate way to optimize data data transfer?
I'm using a device with compute capability 8.0+.