I have written a CUDA code to solve an NP-Complete problem, but the performance was not as I suspected.
I know about "some" optimization techniques (using shared memroy, textures, zerocopy...)
What are the most important optimization techniques CUDA programmers should know about?