0

For illustrative purposes, let

__device__ void distance(char *s1, char* s2)

be the device function, which is run over across several blocks and threads compute<<<1024,256>>>(s1, s2, s3).

We can assume char *s1 and char *s2 are generated prior to issuing CUDA instructions, and that they are constant throughout execution of all kernels. Is there a way to allocate s1 and s1 such that transferring them to all threads is optimized? Is using __const__ declaration an appropriate way to optimize data data transfer?

I'm using a device with compute capability 8.0+.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Ameer Jewdaki
  • 1,758
  • 4
  • 21
  • 36
  • constant memory has a limit of 64kb, so unless your s1 and s2 are small, it is unlikely to be much use to you – talonmies Dec 31 '20 at 03:34
  • @talonmies I only meant sharing the pointers themselves, so 16 bytes only. – Ameer Jewdaki Dec 31 '20 at 03:39
  • 2
    All kernel arguments are automagically passed in constant memory so the optimal solution is to do exactly nothing – talonmies Dec 31 '20 at 03:51
  • 2
    See the linked duplicate for a more elaborate answer covering simple arguments like pointers as well as passing structs by value – talonmies Dec 31 '20 at 04:07
  • thanks, my experiments also showed that adding `__constant__` didn't improve the results, so I was wondering if there's anything else to do. – Ameer Jewdaki Dec 31 '20 at 04:21

0 Answers0