C++ function optimization

Question

I have a function as follows, it is called many times, which makes my program run slowly. Is there any way to optimize it? For example, using SIMD instructions or other techniques. The getray() function is a to retrieve a vector-3 given vector-2 query from a pre-computed look-up table. It is compiled in Visual-studio-2013 and the target configuration is x64 machine.

By the way, the for-loop which calls this function many times is already optimized by using OpenMP.

Thank you very much.

bool warpPlanarHomography(
const Eigen::Matrix3d& H_camera2_camera1
, const cv::Mat& image1
, const cv::Mat& image2
, FisheyeCameraUnified& cam1
, FisheyeCameraUnified& cam2
, const Eigen::Vector2i& patchCenter
, const int patchSize
, Eigen::Matrix<unsigned char, 7, 7>& patch1)
{
const int patchSize_2 = 3;
for (int v = 0; v < patchSize; ++v) // row
{
    for (int u = 0; u < patchSize; ++u)
    {
        Eigen::Vector2i p1 = Eigen::Vector2i(u - patchSize_2, v - patchSize_2).cast<int>() + patchCenter;

        if (p1(0, 0) < 0 || p1(1, 0) < 0 || p1(0, 0) >= image1.cols || p1(1, 0) >= image1.rows) return false;

        Eigen::Vector3d ray1;
        cam1.getRay(p1(1, 0), p1(0, 0), ray1);
        Eigen::Vector2d p2;
        if (!cam2.project(H_camera2_camera1 * ray1, p2))
        {
            return false;
        }
        if (p2.x() < 0.0 || p2.x() >= image2.cols - 1 ||
            p2.y() < 0.0 || p2.y() >= image2.rows - 1)
        {
            return false;
        }
        getInterpolatedPixel(image2, p2, &patch1(v, u));
    }
}
return true;
}

, where the project function looks like this

bool FisheyeCameraUnified::project(const Eigen::Vector3d& ray, Eigen::Vector2d& pt)
{
    double fx, fy, cx, cy, xi;
    fx = m_K(0, 0);
    fy = m_K(1, 1);
    cx = m_K(0, 2);
    cy = m_K(1, 2);
    xi = m_xi;

    double d = ray.norm();
    double rz = 1.0 / (ray(2) + xi * d);

    // Project the scene point to the normalized plane.
    Eigen::Vector2d m_d(ray(0) * rz, ray(1) * rz);

    // Apply the projection matrix.
    pt(0) = fx * m_d(0) + cx;
    pt(1) = fy * m_d(1) + cy;
    return true;
}

and getInterpolatedPixel() function as follows

void getInterpolatedPixel(const cv::Mat& image, const Eigen::Vector2d& coords, unsigned char* pixel)
{
    int ix = static_cast<int>(coords.x());
    int iy = static_cast<int>(coords.y());
    double dx = coords.x() - ix;
    double dy = coords.y() - iy;
    double dxdy = dx * dy;

    const double w00 = 1.0 - dx - dy + dxdy;
    const double w01 = dx - dxdy;
    const double w10 = dy - dxdy;
    const double w11 = dxdy;

    const unsigned char* p00 = image.data + iy * image.step.p[0] + ix * image.channels();
    const unsigned char* p01 = p00 + image.channels();
    const unsigned char* p10 = p00 + image.step.p[0];
    const unsigned char* p11 = p10 + image.channels();

    for (int i = 0; i < image.channels(); ++i)
    {
        double value = w11 * p11[i] + w10 * p10[i] + w01 * p01[i] + w00 * p00[i];
        pixel[i] = cv::saturate_cast<unsigned char>(value);
    }
}

As you said, the loop in `getInterpolatedPixel` is your bottleneck, did you try to use OpenMP for that? Is OpenMP even an option for you? For simple SIMD instruction usage, try [VC](https://github.com/VcDevel/Vc). — Torbjörn, Aug 19 '16 at 10:22
I did not try to use OpenMP for getInterpolatedPixel function, but I tried for WarpPlanarHomography. It does not give me any advantages. I guess the reason is that the for-loop is small, where the efficiency improvements cannot compensate the overhead caused by OpenMP. So I think it might be a good idea to optimize this small function with SIMD and optimize the outer large for-loop with OpenMP. Yeah, thank you for the library information. I will try it. — Peidong, Aug 19 '16 at 10:36
I think you `project` function was messed up by copy and paste — Simon Kraemer, Aug 19 '16 at 11:35

score 3 · Answer 1 · answered Aug 19 '16 at 10:11

3

measure where is bottleneck and try to optimize that place first
can you use float instead of double?
what are m_K(0, 0), m_K(1, 1)... can you replace it with constants
unroll for (int i = 0; i < image.channels(); ++i) loop if image can have only specific number of channels (1, 3, 4 are typical numbers)
call image.channels() only once and use stored value later
try adding inline modifyer to small functions

answered Aug 19 '16 at 10:11

Anton Malyshev

8,686
2
27
45

On most architectures nowadays, doubles are faster than floats. – Tom Tanner Aug 19 '16 at 10:13
thank you for the reply. I measured the bottleneck that it is the preojct() function and getinterpolatedpixel function. For the rest, it is possible. I will try it. – Peidong Aug 19 '16 at 10:16
what I am thinking is it possible to parallize the for-loop, since for 7x7 patch, it is called 49 times with the same instructions for different data? – Peidong Aug 19 '16 at 10:18
2

@TomTanner it's discussable http://stackoverflow.com/questions/3426165/is-using-double-faster-than-float – Anton Malyshev Aug 19 '16 at 10:19
I would also look at calculating the limits for u and v more carefully so that p1 can never be out of range in the central loop. Can you do an incremental version of `project`, given that each time round the inner loop, the y coordinate of p1 increments by 1? – Martin Bonner supports Monica Aug 19 '16 at 10:20
@Peidong I mean - try to find bottlenecks inside these functions – Anton Malyshev Aug 19 '16 at 10:20
@MartinBonner I cannot do that, it is not predictable. It might increment/decrement for the projected points in other direction. – Peidong Aug 19 '16 at 10:37
@TomTanner: `float` is definitely faster than `double`, unless you write code that converts them to double by multiplying them by `2.0` instead of `2.0f` or something. `float` requires half the memory bandwidth, and twice as many elements fit in a SIMD vector. add / mul / FMA have equal speed per SIMD vector for float/double on modern x86, according to [Agner Fog's instruction tables](http://agner.org/optimize/). So for code that vectorizes, twice the throughput for float. And even better than that for float div and sqrt, which are lower latency / higher throughput per vector than double. – Peter Cordes Aug 19 '16 at 17:29

score 3 · Answer 2 · answered Aug 19 '16 at 10:46

This should be considered in addition to other, more broadly focused answers.

Since getInterpolatedPixel is used in a tight loop, I focused there on reducing function calls:

void getInterpolatedPixel(const cv::Mat& image, const Eigen::Vector2d& coords, unsigned char* pixel)
{
    //save two function calls here
    double dx = coords.x();
    double dy = coords.y();
    int ix = static_cast<int>(dx);
    int iy = static_cast<int>(dy);
    dx -= ix;
    dy -= iy;
    //make this const
    const double dxdy = dx * dy;

    const double w00 = 1.0 - dx - dy + dxdy;
    const double w01 = dx - dxdy;
    const double w10 = dy - dxdy;
    const double w11 = dxdy;

    //cache image.channels()
    const int channels = image.channels();

    const unsigned char* p00 = image.data + iy * image.step.p[0] + ix * channels;
    const unsigned char* p01 = p00 + channels;
    const unsigned char* p10 = p00 + image.step.p[0];
    const unsigned char* p11 = p10 + channels;

    for (int i = 0; i < channels; ++i)
    {
        double value = w11 * p11[i] + w10 * p10[i] + w01 * p01[i] + w00 * p00[i];
        pixel[i] = cv::saturate_cast<unsigned char>(value);
    }
}

C++ function optimization

2 Answers2