Intel Tip: If you can’t break a memory roof, try to rework your algorithm for higher arithmetic intensity. This will move you to the right and give you more room to increase performance before hitting the memory bandwidth roof.
For algorithms in the memory-bound region of a roofline plot, Intel suggests increasing the arithmetic intensity so that they move to the right (compute-bound region) hence providing room to improve the performance, since the performance roof would be higher.
I'm unable to understand how increasing the arithmetic intensity (say, increasing the no. of operations in the algorithm) can possibly improve a performance metric like the wall-clock time taken for the algorithm to run. Wouldn't you need to do more no. of computations even for a higher performance (in FLOPS)? Could someone explain how this is possible?