My approach to these would be to read the reference manual ( http://www.arm.com/miscPDFs/5499.pdf ) which should cover everything you need. This will show you if there is a floating-point unit, which drawbacks there are in the FPU, what you have to keep in mind when using DMA's, cache and memory layout as well as the memory bus speeds and a lot other things that are crucial if you want to program this device correctly and efficient.
Unfortunatly, I have never worked with this specific device, so I can not point at anything specific, but you will sure find all you need in the RefManual. If you know the hardware, you can analyse the performance impact of specific parts of the algorithm. But you have to know the hardwares internals.