Preferrably you want to have at least one full warp of threads in a block, otherwise you're making only poor use of the available processing power. Also you normally want to have a evenly divisible by the warp size number of threads in a block.
The toal number of threads to use in a block depends on your resource usage. In principle you want to aim for a large occupancy. The limits are set by available shared memory and registers. If you make use of a lot of shared memory and/or registers the maximum achievable occupancy drops. It then makes sense to profile and fine tune the number of threads per block until you find a sweet spot, where the ratio of achieved and theoretical occupancy maximizes, and of course also the total occupancy itself gets as close as possible to 100%.
As a rule of thumb you want to maximize the numbers of threads per block while keeping good occupancy. It makes totally sense in a profiling step to automatically iterate through the set of possible block/thread number combinations to find the extremal combination.