Floating point algorithms with potential for performance optimization

Question

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.

The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.

Anyone have any ideas, or an algorithm/method that could use a boost?

I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.

Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.

Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.

Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.

Eliminating matrix multiplication and FFT makes this really difficult. Can you consider framing another problem in terms of one of those as an optimization? — Patricia Shanahan, Mar 02 '14 at 15:43
Is it specifically the FFT that's ruled out? Fast Walsh Transforms have essentially the same structure. The raw matrix form of the transform can be decomposed into products of log(N) sparse matrices to yield the speedup. — pjs, Mar 02 '14 at 17:00
Forgive me. I'm sure that's what you are committed to presenting, but when I look at on-line courseware, I see lots of that kind of material. I was also a professor once, and my colleagues and I had never seen any very realistic software. It is really great how hardware engineers and compiler designers try to maximize throughput etc. If you're teaching programmers, they are on the other end of the pipeline, and practically nobody is teaching them how to minimize instructions to be executed. As an example, [*I try to*](http://stackoverflow.com/a/927773/23771). — Mike Dunlavey, Mar 02 '14 at 18:47
@MikeDunlavey I have clarified the question in that I am taking the course and not teaching it, maybe I should have been clearer about that. As for the material: I am confident the Prof. supervising the project is capable, as this is his main research area and he has been teaching the course for a few years. — jmiserez, Mar 02 '14 at 19:18
@pjs The main reason FFT is ruled out is because it will be used extensively as an example in the lectures. Therefore unfortunately FWHT is ruled out too (along with many similar other transforms). — jmiserez, Mar 02 '14 at 19:22
Yes. Professors are held in high esteem by their students. They have to be, in order for education to work. Get some exposure to real-world software and your perspective will broaden, guaranteed. Performance is like an iceberg. Instruction-level parallelism is the part you see. Bloated design is the part you will see later. [*More on that.*](http://scicomp.stackexchange.com/a/2719/1262) — Mike Dunlavey, Mar 02 '14 at 19:49
So you're giving a single lecture in the course or you're looking for a term project or something? — tmyklebu, Mar 02 '14 at 22:05
@tmyklebu It is a term project I have to do, and the first step is finding a suitable algorithm. The rest of the project will be optimizing the chosen algorithm. — jmiserez, Mar 03 '14 at 00:21
@jmiserez: I'd do the small-window image convolutions thing. — tmyklebu, Mar 03 '14 at 02:38
take a look at NAS parallel benchmarks. It has a suite of floating point applications. — arunmoezhi, Mar 03 '14 at 06:40
@MikeDunlavey Thanks for the links. One had already come up on previous searches of Stackoverflow, and I will certainly use some of the tips mentioned. — jmiserez, Mar 04 '14 at 17:35

Stephen Canon · Accepted Answer · 2014-03-02T17:32:12.863

The simplest possible example:

accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.

A few classics from image/signal processing:

convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).

score 1 · Answer 2 · answered Mar 03 '14 at 02:37

Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:

Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)

Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.

Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.

Mike Dunlavey · Answer 3 · 2014-03-03T18:18:39.303

There is a free benchmark called c-ray. It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.

A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.

They also show some opportunities for larger speedup, such as:

It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.

So if all you want to look at is chip-level performance, it could be a decent example. However, it also shows how there can be much bigger opportunities.

Floating point algorithms with potential for performance optimization

3 Answers3