Your biggest question has already been answered in About warp and threads and How are CUDA threads divided into warps?. So, I have focuses this answer in the why.
The blocksize in CUDA is always a multiple of the warp size. The warp size is implementation defined and the numbe 32 is mainly related to shared memory organization, data access patterns and data flow control [ 1 ].
So, a blocksize being a multiple of 32 does not improves performance but means that all the threads will be used for something. Note that used for something depends on what you do with the threads within the block.
A blocksize being not a multiple of 32 will rounds up to the nearest multiple, even if you request fewer threads. See GPU Optimization Fundamentals presentation of Cliff Woolley from the NVIDIA
Developer Technology Group has interesting hints about performance.
In addition, memory operations and instructions are executed per warp, so you can understand the importance of this number. I think the reason why it is 32 and not 16 or 64 is undocumented. So I like remember the warp size as "The Answer to the Ultimate Question of Life, the Universe, and Everything" [ 2 ].
[1] David B Kirk and W Hwu Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Elsevier, 2010.
[2] The Hitchhiker's Guide to the Galaxy.