Aligned loads are stores are faster, two excerpts from the Intel Optimization Manual cleanly point this out:
3.6 OPTIMIZING MEMORY ACCESSES
Align data, paying attention to data layout and stack alignment
...
Alignment and forwarding problems are among the most common sources of
large delays on processors based on Intel NetBurst microarchitecture.
AND
3.6.4 Alignment
Alignment of data concerns all kinds of variables:
• Dynamically allocated variables
• Members of a data structure
• Global or local variables
• Parameters passed on the stack
Misaligned data
access can incur significant performance penalties. This is
particularly true for cache line splits.
Following that part in 3.6.4, there is a nice rule for compiler developers:
Assembly/Compiler Coding Rule 45. (H impact, H generality) Align data on
natural operand size address boundaries. If the data will be accessed with vector
instruction loads and stores, align the data on 16-byte boundaries.
followed by a listing of alignment rules and another gem in 3.6.6
User/Source Coding Rule 6. (H impact, M generality) Pad data
structures defined in the source code so that every data element is
aligned to a natural operand size address boundary.
Both rules are marked as high impact, meaning they can greatly change performance, along with the excerpts, the rest of Section 3.6 is filled with other reasons to naturally align your data. Its well worth any developers time to read these manuals, if only to understand the hardware he/she is working on.