The most valuable tool when hunting performance bottlenecks is measurement. You need to figure out what code has the problem and then measure it for cache misses, if that indeed proves to be the problem.
As for general ideas, you will need to lower the miss rate. So when you pull data into memory, you need to work as much as possible on it before you leave it again, rather than stream data. Compare as an example,
for i in data:
f(i)
for i in data:
g(i)
for i in data:
h(i)
which traverses the list three times. It may be possible to write this as:
for i in data:
h(g(f(i)))
lowering the traverse to only a single time - usually leading to fewer misses.
Another worthy trick is to think about the data structure. The access patterns of a binary tree are much different from those of a hash table. But establish measurement first so you can be sure you got the misses nailed - and that it is the misses that is your problem.
Finally, even with low miss rates, you can look into lowering memory bandwidth in general. If you move lots and lots of data, it tend to be slow - since memory speeds grow at a much lower rate compared to transistor count.