Usually you want to choose the size of your blocks based on your GPU architecture, with the goal of maintaining 100% occupancy on the Streaming Multiprocessor (SM). For example, the GPUs at my school can run 1536 threads per SM, and up to 8 blocks per SM, but each block can only have up to 1024 threads in each dimension. So if I were to launch a 1d kernel on the GPU, I could max out a block with 1024 threads, but then only 1 block would be on the SM (66% occupancy). If I instead chose a smaller number, like 192 threads or 256 threads per block, then I could have 100% occupancy with 6 and 8 blocks respectively on the SM.
Another thing to consider is the amount of memory that must be accessed vs the amount of computation to be done. In many imaging applications, you don't just need the value at a single pixel, rather you need the surrounding pixels as well. Cuda groups its threads into warps, which step through every instruction simultaneously (currently, there are 32 threads to a warp, though that may change). Making your blocks square generally minimizes the amount of memory that needs to be loaded vs the amount of computation that can be done, making the GPU more efficient. Likewise, blocks that are a power of 2 load memory more efficiently (if properly aligned with memory addresses) since Cuda loads memory lines at a time instead of by single values.
So for your example, even though it might seem more effective to have a grid that is 317x1 and blocks that are 1x217, your code will likely be more efficient if you launch blocks that are 16x16 on a grid that is 20x14 as it will lead to better computation/memory ratio and SM occupancy. This does mean, though, that you will have to check within the kernel to make sure the thread is not out of the picture before trying to access memory, something like
const int thread_id_x = blockIdx.x*blockDim.x+threadIdx.x;
const int thread_id_y = blockIdx.y*blockDim.y+threadIdx.y;
if(thread_id_x < pic_width && thread_id_y < pic_height)
{
//Do stuff
}
Lastly, you can determine the lowest number of blocks you need in each grid dimension that completely covers your image with (N+M-1)/M where N is the number of total threads in that dimension and you have M threads per block in that dimension.