Alignment is a definition. Assuming 8 bit bytes and the memory is byte addressable. an 8 bit byte (unsigned char) cannot be unaligned. a 16 bit halfword to be aligned must have the lsbit zero. A 32 bit word the lower two bits zero, 64 bit doubleword three bits zero and so on. So if your 16 bit unsigned short is on an odd address then it is unaligned.
A "32 bit system" does not mean a 32 bit bus, bus widths do not necessarily match the size of the processor registers or instruction size or whatever. No reason to make that assumption. Saying that though, if you are talking MIPS or ARM then yes the buses are most likely 32 or 64 bit for their 32 bit register processors and 64 or perhaps 128 for 64 bit processors, likely 64 bit. But an x86 has 8 bit instructions with 8,16,32,64 bit registers and variable length instructions when you add up the bytes it can possibly take, there is no way to classify its sizes is it an 8 bit processor with its 8 bit instructions 32 or 64 due to its larger register sizes or 128,256,512 etc due to its bus sizes?
You mentioned 32, let's stick with that. I want to walk through an array of bytes, I want to do writes. I have a 32 bit wide data bus one of the typical designs you see today. Let's say the other side is a cache and it is built of 32 bit wide srams to line up with the processor side bus, we won't worry about how the dram is implemented on the other side. So you will likely have a write data bus, a read data bus and either separate write address and read address or one address bus with a way to indicate a read/write transaction.
As far as the bus is concerned all transactions are 32 bit, you don't necessarily expect the unused byte lanes to float, z state, you expect them to be high or low for valid clocks on that bus (Between valid clock cycles sure the bus may go high-z).
A read transaction will typically be and let's assume be an aligned address to the bus width so a 32 bit aligned address (either on the bus or on the far side). There isn't usually a notion of byte lane enables on a read, the processor internally isolates the bytes of interest and discards the others. Some have a length field on the address bus where it makes sense. plus cache control signals and other signals.
An aligned 32 bit read would be say address 0x1000 or 0x1004 length of 0 (n-1), the address bus does its handshake with a unique transaction id, later on the read data bus ideally a single clock cycle will contain that 32 bits of data with that id, the processor sees that and completes the transaction (might be more handshaking) and extracts all 4 bytes and does what the instruction said to do with them.
A 64 bit access aligned on a 32 bit boundary would have a length of one, one address bus handshake, two clocks cycles worth of data on the read data bus. A 16 bit aligned transaction at 0x1000 or 0x1002 will let's say be a read of 0x1000 and the processor will discard either lanes 0 and 1 or lanes 2 and 3, some bus designs align the bytes on the lower lanes so you might see a bus where the two bytes always come back on lanes 0 and 1 for a 16 bit read.
An unaligned 32 bit read would take two bus cycles, twice the overhead, twice the number of clocks a 0x1002 32 bit read is one 0x1000 read where the processor saves 2 of the bytes, then a 0x1004 read and the processor saves two of those byte combines them into the 32 bit number and then does what the instruction says so instead of 5 or 8 or whatever the minimum is for this bus it is now twice as many and likely not interleaved but back to back.
An unaligned 16 bit at address 0x1001 would be a single 32 bit read hopefully but an unaligned 16 bit read at address 0x1003 is two transactions now twice the clocks twice the overh head one at 0x1000 and one at 0x1004 saving one byte each.
Writes are the same but with an additional penalty. Aligned 32 bit writes, say at 0x1000 one bus transaction, address, write data, done. The cache being 32 bits wide in this example could simply write those 32 bits to sram in one sram transaction. An unaligned 32 bit write say at 0x1001, would be two complete bus transactions as expected taking twice the number of bus clocks but also the sram will take two or more number of clocks as well because you need to read-modify-write the sram you can't just write. in order to write the 0x1001 to 0x1003 bytes you need to read 32 bits from sram, change three of those bytes not changing the lower one, and write that back. Then when the other transaction comes in you write the 0x1004 byte while preserving the other three in that sram location.
All byte writes are a single bus transaction per, but all also incur the read-modify-write. Note that depending on how many clocks the bus takes and how many transactions you can have in flight at a time, the read-modify-write of the sram might be invisible you might not be able to get data to the cache fast enough to have a bus transaction have to wait on the sram read-modify-write, but in another similar question since this has been asked so many times here, there is a platform where this was demonstrated.
So you can now tell me how the 16 bit write transactions are going to go, they also incur the read-modify-write at the cache for every one of them, if the address is say 0x1003 then you get two bus transactions and two read-modify-writes.
One of the beauties of the cache though is that even though drams come in 8, 16, 32 bit parts (count how many chips are on a dram stick, often 8 or 9, 4 or 5 or 2 or 3 or some multiple of those. 8 is likely a 64 bit wide bus 8 bits per part, 16 64 bit wide, 8 bits per part, dual rank and so on) the transactions are done in 32 or 64 bit widths, which is kind of the point of a cache. If we were to have to do a read-modify-write at the drams slow speeds that would be horrible, we read-modify-write at the cache/sram speed, then all transactions, cache line evictions and fills are at multiples of the dram bus width so 64 or 2x64 or 4x64 etc per cache line.