We should take a step back and try to explain how CPUs work. Usually they have different caches, one for the code, which tells the CPU the instructions that will be needed to execute, and one for data, where operations are applied to.
Data cache misses are "easy" to solve, try to use the smallest data structures you can, put close together members that you access more frequently...
Instruction cache misses are more difficult to understand and solve, and that's also the reason why it's commonly recognized that polymorphic behavior in C++ is slower than normal function calls. Basically the CPU will prefetch in its caches the instructions that are stored close to the execution point you're trying to execute, if everything is inline, there's just more data and it won't be able to prefetch everything, leading to a cache miss. Please note this is just a simplistic case, in my experience I had problems with template instantiations that would generate a lot of code, leading to a slower performance than just having simple virtual calls and a not too deep object hierarchy.
As Alexandrescu always points out, you should always time your code
Source:
What Every Programmer Should Know About Memory