Slow texture fetch in fragment shader using Vulkan

Question

I am doing a SSAO shader with a kernel size of 64.

SSAO fragment shader:

const int kernelSize = 64;
for (int i = 0; i < kernelSize; i++) {
        //Get sample position
        vec3 s = tbn * ubo.kernel[i].xyz;
        s = s * radius + origin;
        vec4 offset = vec4(s, 1.0);
        offset = ubo.projection * offset;
        offset.xy /= offset.w;
        offset.xy = offset.xy * 0.5 + 0.5;
        float sampleDepth = texture(samplerposition, offset.xy).z;
        float rangeCheck = abs(origin.z - sampleDepth) < radius ? 1.0 : 0.0;
        occlusion += (sampleDepth >= s.z ? 1.0 : 0.0) * rangeCheck;
    }

The samplerposition texture has the format VK_FORMAT_R16G16B16A16_SFLOAT and is uploaded with the flag VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT.

Im using a laptop with a nvidia K1100M graphic card. If I run the code in renderdoc, this shader takes 114 ms. And if I change the kernelSize to 1, it takes 1 ms.

Is this texture fetch time normal? Or can it be that I have set up something wrong somewhere?

Like the layout transition did not go through, so the texture is in VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL instead of VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.

Why do you think this is vulkan specific? You simply do a lot of work! I think the fact that running the loop 64x vs. 1x costs more than 64x is because on 1x the compiler will get rid of the loop. Try changing resolution. Does time change linear with the number of pixels? If yes, you are just saturating the GPU. Optimize the loop! — starmole, Aug 16 '16 at 05:14

score 4 · Accepted Answer · answered Aug 15 '16 at 12:47

4

GPU memory relies on heavy cache usage, which is very limited if fragments close to each other do not sample texels that are next to each other - also known as a lack of spatial coherence. I would expect about 10x slowdowns or more on random access to a texture versus linear, coherent access. SSAO is very prone to this when used with large radii.

I recommend using smaller radii and optimizing the texture accesses. You're sampling 4 16 bit floats, but you're only using one. Blitting the depth to a separate 16 bit depth only image should give you an easy 4x speedup.

answered Aug 15 '16 at 12:47

Quinchilion

912
6
16

2

It'd probably be better to not write the position at all. Just reconstruct it as needed from the depth and fragment coord; that's how most deferred renderers work. – Nicol Bolas Aug 15 '16 at 15:19
1

@NicolBolas Interesting, I was just searching for it, wonder why I never read about that before. It will not help for SSAO though – Samantha Aug 15 '16 at 18:28
1

@Samantha: Sure it will. Rather than fetching 64 bytes, you only fetch 32 bytes (the depth buffer value). – Nicol Bolas Aug 15 '16 at 18:45
@NicolBolas yes indeed, but this is the way people normally do it, iterate 32-64 times in pixel shaderf or ssao, it will still take a lot of time because you will still get cache misses – Samantha Aug 15 '16 at 19:07
1

@Samantha: I mean each fetch only accesses 32-bytes of data. Reducing the size of the data you retrieve improves cache coherency and overall is helpful to texture access performance. – Nicol Bolas Aug 15 '16 at 19:36

codetiger · Answer 2 · 2016-08-16T03:55:42.520

1

You are calculating the Texture coordinates on the fragment shader which means you are not allowing the GPU to pre-fetch the textures. Better calculate all texture coordinates on the Vertex shader and pass it as varying.

Updated: I would suggest adding some advanced tricks on SSAO than trying to purely calculate the AO map. 1. You can render a much smaller AO Map and upscale it by adding some blur filter. This will give much better results. 2. If you are trying to do realtime rendering, then AO Map does not need to be calculated every frame. You can fake it based on your setup.

Disclaimer: I do a lot of OpenGL ES based shaders, and my knowledge is mostly limited to Mobile Platforms.

edited Aug 16 '16 at 03:55

answered Aug 15 '16 at 11:20

codetiger

2,650
20
37

So 114 ms sounds like a normal time if it has not been pre-fetch by the shader? – Samantha Aug 15 '16 at 12:05
By saying 114ms do you mean per fragment? or for the entire frame? – codetiger Aug 15 '16 at 12:07
for the SSAO shader, the whole draw call, its 114ms / 64 = 17 ms per texel fetch on 1024x1024 window. – Samantha Aug 15 '16 at 12:36
As far as I'm aware, dependent texture reads are indeed only an issue on mobile platforms. It's still a good idea on all platforms to move calculations from fragment shader to the vertex shader whenever you can, though. – Quinchilion Aug 15 '16 at 13:25
@codetiger It goes much faster when I am pre-calculating the coordinates in the vertex shader, but I cant really do it in this case, I will not get the same interpolated values. – Samantha Aug 15 '16 at 14:38
@Samantha Can you explain why you are not able to pre-calculate the coordinates in vertex shaders? – codetiger Aug 16 '16 at 03:56
@codetiger: Probably because there are 64 of them. Most GPUs don't let the VS shove that much data at the FS. – Nicol Bolas Aug 16 '16 at 14:12

Slow texture fetch in fragment shader using Vulkan

2 Answers2