Speed up nd-array computations in Numpy

Question

I want to implement a custom undistortion function like in OpenCV using numpy module on Python.

From documentation is known that undistort function is just a combination of of initUndistortRectifyMap() and remap().

Since remap() is quite simple operation, the main issue is to implement maps for remap().

I wrote a code to construct maps, but it seems to me that it works quite slowly.

The code consists of three main parts:

Reshape original image points to a well-shaped array to multiply it on the inverse of the camera matrix and perform a multiplication.
Distort points in the z = 1 plane.
Reshape points again to perform another multiplication to get back to image points.

I took an image with size of (4032 x 3024).

One matrix multiplication works on my pc for about 1 sec. And the distortion function works for about 2.4 sec.

I tried to multiply same shaped matrices with OpenCV Mats on C++, and I took 0.0002 sec.

The question is how to speed up the computations, because it seems to me that I am doing something wrong, because of such a big difference.

I found here an advice to make all arrays contiguous, but this did not help

The code:

import numpy
import time


def _distort_z_1(x, y, k1, k2, k3, k4, k5, k6, p1, p2):
    x2 = x * x
    y2 = y * y
    xy = x * y

    r2 = x2 + y2
    r4 = r2 * r2
    r6 = r4 * r2

    radial = \
        (1 + k1 * r2 + k2 * r4 + k3 * r6) / \
        (1 + k4 * r2 + k5 * r4 + k6 * r6)

    tangential_x = 2 * p1 * xy + p2 * (r2 + 2 * x2)
    tangential_y = p1 * (r2 + 2 * y2) + 2 * p2 * xy

    x_distorted = x * radial + tangential_x
    y_distorted = y * radial + tangential_y

    return x_distorted, y_distorted


# Change dimension from [2 x H x W] to [H x W x 3 x 1] to correctly multiply with [3 x 3] matrix
def _homogeneous_reshape(points_x, points_y):
    points_homogeneous_reshaped = (
        # Add extra axis to change from [H x W x 3] to [H x W x 3 x 1]
        numpy.expand_dims(
            # Change from [3 x H x W] to [H x W x 3]
            numpy.transpose(
                # Change from [2 x H x W] to [3 x H x W] (homogeneous coordinates)
                numpy.stack(
                    numpy.broadcast_arrays(points_x, points_y, 1)),
                (1, 2, 0)),
            -1))

    return points_homogeneous_reshaped


def _homogeneous_reshape_back(points_homogeneous_reshaped):
    points_homogeneous = (
        # Get back from [H x W x 3] to [3 x H x W]
        numpy.transpose(
            # Remove extra axis: [H x W x 3 x 1] to [H x W x 3]
            numpy.squeeze(
                points_homogeneous_reshaped),
            (2, 0, 1)))

    # Get back from homogeneous coordinates
    points_x, points_y, _ = points_homogeneous

    return points_x, points_y


def _get_undistort_rectify_maps(distortion_coefficients, camera_matrix, image_width, image_height):
    image_points = numpy.meshgrid(range(image_width), range(image_height))

    # print("BEGIN: _homogeneous_reshape")
    start = time.time()
    image_points_homogeneous_reshaped = _homogeneous_reshape(*image_points)
    end = time.time()
    print("END: _homogeneous_reshape", end - start)

    camera_matrix_inv = numpy.linalg.inv(camera_matrix)

    # print("BEGIN: camera_matrix_inv @ image_points_homogeneous_reshaped")
    start = time.time()
    image_points_homogeneous_z_1_reshaped = camera_matrix_inv @ image_points_homogeneous_reshaped
    end = time.time()
    print("END: camera_matrix_inv @ image_points_homogeneous_reshaped", end - start)

    # print("BEGIN: _homogeneous_reshape_back")
    start = time.time()
    image_points_z_1 = _homogeneous_reshape_back(image_points_homogeneous_z_1_reshaped)
    end = time.time()
    print("END: _homogeneous_reshape_back", end - start)

    # print("BEGIN: _distort_z_1")
    start = time.time()
    x_distorted, y_distorted = _distort_z_1(
        *image_points_z_1,
        **distortion_coefficients)
    end = time.time()
    print("END: _distort_z_1", end - start)

    # print("BEGIN: _homogeneous_reshape")
    start = time.time()
    points_homogeneous_z_1_distorted_reshaped = _homogeneous_reshape(x_distorted, y_distorted)
    end = time.time()
    print("END: _homogeneous_reshape", end - start)

    # print("BEGIN: _homogeneous_reshape")
    start = time.time()
    points_homogeneous_distorted_reshaped = camera_matrix @ points_homogeneous_z_1_distorted_reshaped
    end = time.time()
    print("END: camera_matrix @ points_homogeneous_z_1_distorted_reshaped", end - start)

    # print("BEGIN: _homogeneous_reshape_back")
    start = time.time()
    points_homogeneous_distorted = _homogeneous_reshape_back(points_homogeneous_distorted_reshaped)
    end = time.time()
    print("END: _homogeneous_reshape_back", end - start)

    return (map.astype(numpy.float32) for map in points_homogeneous_distorted)


if __name__ == "__main__":
    image_width = 4032
    image_height = 3024

    distortion_coefficients = {
        "k1": 0, "k2": 0, "k3": 0, "k4": 0, "k5": 0, "k6": 0,
        "p1": 0, "p2": 0}

    camera_matrix = numpy.array([
        [1000, 0, 2016],
        [0, 1000, 1512],
        [0, 0, 1]])

    map_x, map_y = _get_undistort_rectify_maps(
        distortion_coefficients,
        camera_matrix,
        image_width,
        image_height)

I tried to make your description more readable, but the code is really way too long for anyone to really dig into. Please try to give more of a [mcve] — Daniel F, Nov 07 '19 at 12:40
An easy way to get quick wins on Numpy-heavy code is to use http://numba.pydata.org/ – `@numba.jit(nopython=True)` on `_distort_z_1()`, for instance... — AKX, Nov 07 '19 at 12:47
Also, it's worth checking you have a fast build of Numpy. https://towardsdatascience.com/is-your-numpy-optimized-for-speed-c1d2b2ba515 — AKX, Nov 07 '19 at 12:57
As suggested you could use Numba, or even try to Cythonize it as it does simple maths operation — Matteo Peluso, Nov 07 '19 at 13:10
@klimenkov For subsequent invocations, it sped up the distort_z_1 function by about 50% for me. — AKX, Nov 07 '19 at 14:06
It seems like you're doing a lot of the calculations in `float64`, while it's likely that 32bit would be sufficient enough. | Especially `_distort_z_1` seems quite cache-unfriendly -- as written, it will amount to many linear passes over arrays that are quite large. It might be better to run that on smaller stripes of the image in sequence (and you can also run multiple stripes in parallel). — Dan Mašek, Nov 07 '19 at 17:17
Another observation -- the camera matrix is convenient for notation, but notice that it only holds 4 relevant constants - `fx, cx. fy, cy`. You only have to calculate `fx * x + cx` and `fy * y + cy` (and similar for the inverse). That's 2 multiplications and 2 additions for a coordinate pair. Full matrix multiplication as you have it performs 9 multiplies and 6 adds. Plus, you avoid all the reshaping shenanigans. | The first one can be even faster, since you can do it before `meshgrid`. — Dan Mašek, Nov 07 '19 at 20:35
Some progress: [v1](https://pastebin.com/EK0xQBkQ) -- use 32bit floats | [v2](https://pastebin.com/KTUgyBCy) -- and split into small blocks | [v3](https://pastebin.com/FVzGiz9r) -- simplify the matrix multiplication | [v4](https://pastebin.com/g1YEmanC) -- fold it together. Improvement from 2.7s to about 0.2s. Here's a [benchmark graph](https://i.imgur.com/Sfua1Mu.png). You could still parallelize it for some more gain. — Dan Mašek, Nov 08 '19 at 00:47

Speed up nd-array computations in Numpy

0 Answers0