Faster Method for Multiple Bilinear Interpolation?

Question

I am writing a program in C++ to reconstruct a 3D object from a set of projected 2D images, the most computation-intensive part of which involves magnifying and shifting each image via bilinear interpolation. I currently have a pair of functions for this task; "blnSetup" defines a handful of parameters outside the loop, then "bilinear" applies the interpolation point-by-point within the loop:

(NOTE: 'I' is a 1D array containing ordered rows of image data)

//Pre-definition structure (in header)
struct blnData{
    float* X;
    float* Y;
    int* I;
    float X0;
    float Y0;
    float delX;
    float delY;
};

//Pre-definition function (outside the FOR loop)
extern inline blnData blnSetup(float* X, float* Y, int* I)
{
    blnData bln;
    //Create pointers to X, Y, and I vectors
    bln.X = X;
    bln.Y = Y;
    bln.I = I;

    //Store offset and step values for X and Y
    bln.X0 = X[0];
    bln.delX = X[1] - X[0];
    bln.Y0 = Y[0];
    bln.delY = Y[1] - Y[0];

    return bln;
}

//Main interpolation function (inside the FOR loop)
extern inline float bilinear(float x, float y, blnData bln)
{
    float Ixy;

    //Return -1 if the target point is outside the image matrix
    if (x < bln.X[0] || x > bln.X[-1] || y < bln.Y[0] || y > bln.Y[-1])
        Ixy = 0;
    //Otherwise, apply bilinear interpolation
    else
    {
        //Define known image width
        int W = 200;

        //Find nearest indices for interpolation
        int i = floor((x - bln.X0) / bln.delX);
        int j = floor((y - bln.Y0) / bln.delY);

        //Interpolate I at (xi, yj)
        Ixy = 1 / ((bln.X[i + 1] - bln.X[i])*(bln.Y[j + 1] - bln.Y[j])) *
            (
            bln.I[W*j + i] * (bln.X[i + 1] - x) * (bln.Y[j + 1] - y) +
            bln.I[W*j + i + 1] * (x - bln.X[i]) * (bln.Y[j + 1] - y) +
            bln.I[W*(j + 1) + i] * (bln.X[i + 1] - x) * (y - bln.Y[j]) +
            bln.I[W*(j + 1) + i + 1] * (x - bln.X[i]) * (y - bln.Y[j])
            );
    }

    return Ixy;
}

EDIT: The function calls are below. 'flat.imgdata' is a std::vector containing the input image data and 'proj.imgdata' is a std::vector containing the transformed image.

int Xs = flat.dim[0];
int Ys = flat.dim[1];

int* Iarr = flat.imgdata.data();
float II, x, y;

bln = blnSetup(X, Y, Iarr);

for (int j = 0; j < flat.imgdata.size(); j++)
{
    x = 1.2*X[j % Xs];
    y = 1.2*Y[j / Xs];
    II = bilinear(x, y, bln);
    proj.imgdata[j] = (int)II;
}

Since I started optimizing, I have been able to reduce computation time by ~50x (!) by switching from std::vectors to C arrays within the interpolation function, and another 2x or so by cleaning up redundant computations/typecasting/etc, but assuming O(n) with n being the total number of processed pixels, the full reconstruction (~7e10 pixels) should still take 40min or so--about an order of magnitude longer than my goal of <5min.

According to Visual Studio's performance profiler, the interpolation function call ("II = bilinear(x, y, bln);") is unsurprisingly still the majority of my computation load. I haven't been able to find any linear algebraic methods for fast multiple interpolation, so my question is: is this basically as fast as my code will get, short of applying more or faster CPUs to the task? Or is there a different approach that might speed things up?

P.S. I've also only been coding in C++ for about a month now, so feel free to point out any beginner mistakes I might be making.

Can you show the code where you call `blnSetup` and `bilinear`? — 1201ProgramAlarm, Aug 31 '17 at 17:09
Change the calls in bilinear to references to avoid stacking and unstacking arguments. Use `const float &x`, for example. — stark, Aug 31 '17 at 17:19
Like I said, I recently started using C++ (from MATLAB); what would be a better way to check against the last element of the array? — KPM, Aug 31 '17 at 17:34
Note every output pixel in a row is going to separately calculate the same Y weight factor, and each column calculates the same X weight factor. I'd start by calculating the weight factors first in a separate structure. — kalhartt, Aug 31 '17 at 17:41
`-1` accesses the memory before the beginning o the array, not the last element. Arrays do not know their size. A vector pre-allocated to the same size will perform just as fast as an array. https://stackoverflow.com/questions/381621/using-arrays-or-stdvectors-in-c-whats-the-performance-gap — stark, Aug 31 '17 at 17:50
@stark That makes sense; I had some strange behavior for certain magnifications, and it looks like that was the cause. Thanks! — KPM, Aug 31 '17 at 17:54

score 3 · Answer 1 · answered Aug 31 '17 at 18:49

I wrote up a long answer suggesting looking at OpenCV (opencv.org), or using Halide (http://halide-lang.org/), and getting into how image warping is optimized, but I think a shorter answer might serve better. If you are really just scaling and translating entire images, OpenCV has code to do that and we have an example for resizing in Halide as well (https://github.com/halide/Halide/blob/master/apps/resize/resize.cpp).

If you really have an algorithm that needs to index an image using floating-point coordinates which result from a computation that cannot be turned into a moderately simple function on integer coordinates, then you really want to be using filtered texture sampling on a GPU. Most techniques for optimizing on the CPU rely on exploiting some regular pattern of access in the algorithm and removing float to integer conversion from the addressing. (For resizing, one uses two integer variables, one which indexes the pixel coordinate of the image and the other which is the fractional part of the coordinate and it indexes a kernel of weights.) If this is not possible, the speedups are somewhat limited on CPUs. OpenCV does provide fairly general remapping support, but it likely isn't all that fast.

Two optimizations that may be applicable here are trying to move the boundary condition out the loop and using a two pass approach in which the horizontal and vertical dimensions are processed separately. The latter may or may not win and will require tiling the data to fit in cache if the images are very large. Tiling in general is pretty important for large images, but it isn't clear it is the first order performance problem here and depending on the values in the inputs, the cache behavior may not be regular enough anyway.

score 2 · Answer 2 · answered Sep 01 '17 at 09:36

2

"vector 50x slower than array". That's a dead giveaway you're in debug mode, where vector::operator[] is not inlined. You will probably get the necessary speedup, and a lot more, simply by switching to release mode.

As a bonus, vector has a .back() method, so you have a proper replacement for that [-1]. Pointers to the begin of an array don't contain the array size, so you can't find the back of an array that way.

answered Sep 01 '17 at 09:36

MSalters

173,980
10
155
350

I actually am in release mode; I got the 50x improvement when I switched the 'I' variable from vector to array--so it seems like there was a problem with the way the function was calling the variable. – KPM Sep 01 '17 at 12:45

Faster Method for Multiple Bilinear Interpolation?

2 Answers2