I read that it somehow reduces fragmentation, but I don't really understand why this is the case.
This is correct. And it works because it sets a minimum size of the allocations. Let's say that you add padding so that each block is at least 1kB. This means that there will never be a situation where you free a piece of memory and that an allocation made after that will not fit into that newly freed memory, as long as the new allocation is at most 1kB.
You could easily see this effect for yourself with some simple experiments on a piece of paper. First have a block size equal to your total memory. Of course there will be no fragmentation, but you will waste a lot of memory. If the block size is half the size it's basically the same scenario, but less waste. When we have a block size that is one third of the total memory we first encounter fragmentation. And that will happen when the middle block is allocated.
In short, padding reduces fragmentation but requires more memory.
Should I pad the entire block to a format of say 4 or 8 byte multiple or should I pad my block excluding the header and footer.
If you implement it in a clever way, then you will be able so change the padding size by simply changing a variable. So find a way to measure problems and adjust the padding to your needs.
Does this also help locality in my program and how does it help?
This is trickier. It can go both ways and depends on the programs using the heap. But my gut feeling says that padding would probably decrease locality. If this is the case, this can impact performance because of cache misses.