1

What is the best way of layout this out in local memory to reduce bank conflicts ?

I was thinking:

RRRRRRRRRRRR...
GGGGGGGGGGGG...
BBBBBBBBBBBB...
AAAAAAAAAAAA...

I would like to grab all four channels at once to use in vector operations.

Thanks!

Jacko
  • 12,665
  • 18
  • 75
  • 126

2 Answers2

1

Then use "RGBARGBARGBARGBA..." and you can grab all four channels at once to use in a vector. Plus, it's one read instead of 4.

Bank conflicts are caused when multiple work items are accessing different areas that are a certain offset from each other. So your image layout doesn't matter as much as your row pitch when it comes to causing a bank conflict.

Dithermaster
  • 6,223
  • 1
  • 12
  • 20
  • Thanks. Suppose each channel is a 32 bit float, and there are 16 banks in the LDS. Each work item needs to access two adjacent pixels. So, workitem 0 accesses buf[0] and buf[1], workitem 1 accesses buf[1] and buf[2] etc. So there will definitely be bank conflict. I am just trying to minimize it. If I use RGBA format, then conflicts may be more severe because four channels are affected in a conflict, instead of one. That is why I was thinking of planar configuration mentioned in original question. What do you think? – Jacko Aug 05 '14 at 18:17
  • Sorry, I'm not an expert in bank conflicts. See http://stackoverflow.com/questions/3841877/what-is-a-bank-conflict-doing-cuda-opencl-programming. My guess is that once you're reading more than 32-bit per work items the reads get serialized (256 bits at a time) across the work group, and you won't have bank conflicts. – Dithermaster Aug 05 '14 at 19:59
  • Yes, good point. Its taking me some time think in terms of SIMT execution. I am planning on using the vload4 command to load all pixel channels in one go. But Ithink there should still be no bank conflicts in this case. – Jacko Aug 05 '14 at 22:44
  • Thanks. Running a simple kernel with this architecture on an AMD HD7700 looks very promising. Thanks again for your help. – Jacko Aug 06 '14 at 02:59
0

On my target architecture, HD7700, the planar configuration gave the best performance: vload4 was much slower. I think this must be due to bank conflicts, but I am not sure.

Jacko
  • 12,665
  • 18
  • 75
  • 126