I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.