Remove for loops for faster execution - vectorize

Question

As a part of my academic project, I am working on a linear filter for an image. Below is the code, using only NumPy (no external libraries) and want to eliminate for loops by vectorizing or any other options. How can I achieve vectorization for faster execution? Thanks for the help.

Inputs -

Image.shape - (568, 768)
weightArray.shape - (3, 3)

    def apply_filter(image: np.array, weight_array: np.array) -> np.array:
        rows, cols = image.shape
        height, width = weight_array.shape
        output = np.zeros((rows - height + 1, cols - width + 1))
    
        for rrow in range(rows - height + 1):
            for ccolumn in range(cols - width + 1):
                for hheight in range(height):
                    for wwidth in range(width):
                        imgval = image[rrow + hheight, ccolumn + wwidth]
                        filterval = weight_array[hheight, wwidth]
                        output[rrow, ccolumn] += imgval * filterval
                        
        return output

Welcome to StackOverflow @parvy. Please create a [Minimal and Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). Add code that is runnable and also some sample data. — Vishnudev Krishnadas, Mar 13 '21 at 14:47
Is SciPy off limits? I'm pretty sure this is a SciPy builtin. — user2357112, Mar 13 '21 at 15:23
You are probably looking for `scipy.signal.convolve2d`. Vectorizing your code is of course possible, but unless you are interested in this for educational purposes, I recommand using scipy. — Feodoran, Mar 13 '21 at 15:45
@user2357112supportsMonica No, external libraries. Want to use NumPy or python operations only — pravy, Mar 13 '21 at 15:47
It is an academic assignment. So they have restrictions @Feodoran — Vishnudev Krishnadas, Mar 13 '21 at 15:56
@Feodoran That's the requirement for the project by professor — pravy, Mar 13 '21 at 15:57
So this is a homework question? Then you should show what you have tried so far and where you failed. — Feodoran, Mar 13 '21 at 15:58
This is what you want probably https://medium.com/analytics-vidhya/implementing-convolution-without-for-loops-in-numpy-ce111322a7cd — Gulzar, Mar 13 '21 at 16:02

Brent Bradburn · Accepted Answer · 2021-04-28T14:35:54.317

Vectorization is the process of converting each explicit for loop into a 1-dimensional array operation. In Python, this will involve reimagining your data in terms of slices.

In the code below, I've provided a working vectorization of the kernel loop. This shows how to approach vectorization, but since it is only optimizing the 3x3 array, it doesn't give you the biggest available gains.

If you want to see a larger improvement, you'll vectorize the image array, which I've templated for you as well -- but left some as an exercise.

import numpy as np
from PIL import Image

## no vectorization
def applyFilterMethod1(image: np.array, weightArray: np.array) -> np.array:
    rows, cols = image.shape ; height, width = weightArray.shape
    output = np.zeros((rows - height + 1, cols - width + 1))

    for rrow in range(rows - height + 1):
        for ccolumn in range(cols - width + 1):
            for hheight in range(height):
                for wwidth in range(width):
                    imgval = image[rrow + hheight, ccolumn + wwidth]
                    filterval = weightArray[hheight, wwidth]
                    output[rrow, ccolumn] += imgval * filterval
                    
    return output

## vectorize the kernel loop (~3x improvement)
def applyFilterMethod2(image: np.array, weightArray: np.array) -> np.array:
    rows, cols = image.shape ; height, width = weightArray.shape
    output = np.zeros((rows - height + 1, cols - width + 1))

    for rrow in range(rows - height + 1):
        for ccolumn in range(cols - width + 1):
            imgval = image[rrow:rrow + height, ccolumn:ccolumn + width]
            filterval = weightArray[:, :]
            output[rrow, ccolumn] = sum(sum(imgval * filterval))
                    
    return output

## vectorize the image loop (~50x improvement)
def applyFilterMethod3(image: np.array, weightArray: np.array) -> np.array:
    rows, cols = image.shape ; height, width = weightArray.shape
    output = np.zeros((rows - height + 1, cols - width + 1))

    for hheight in range(height):
        for wwidth in range(width):
            imgval = 0 ## TODO -- construct a compatible slice
            filterval = weightArray[hheight, wwidth]
            output[:, :] += imgval * filterval
                    
    return output

src = Image.open("input.png")
sb = np.asarray(src)
cb = np.array([[1,2,1],[2,4,2],[1,2,1]])
cb = cb/sum(sum(cb)) ## normalize

db = applyFilterMethod2(sb, cb)

dst = Image.fromarray(db)
dst.convert("L").save("output.png")
#src.show() ; dst.show()

Note: You could probably remove all four for loops, with some additional complexity. However, because this would only eliminate the overhead of 9 iterations (in this example), I don't estimate that it would yield any additional performance gains over applyFilterMethod3. Furthermore, although I haven't attempted it, the way I imagine it would be done might add more overhead than it would remove.

FYI: This is a standard image convolution (supporting only grayscale as implemented). I always like to point out that, in order to be mathematically correct, this would need to compensate for the gamma compression that is implicit in nearly every default image encoding -- but this little detail is often ignored.

Discussion

This type of vectorization is is often necessary in Python, specifically, because the standard Python interpreter is extremely inefficient at processing large for loops. Explicitly iterating over each pixel of an image, therefore, wastes a lot time. Ultimately, though, the vectorized implementation does not change the amount of real work performed, so we're only talking about eliminating an overhead aspect of the algorithm.

However, vectorization has a side-benefit: Parallelization. Lumping a large amount of data processing onto a single operator gives the language/library more flexibility in how to optimize the execution. This might include executing your embarrassingly parallel operation on a GPU -- if you have the right tools, for example the Tensorflow image module.

Python's seamless support for array programming is one reason that it has become highly popular for use in machine learning, which can be extremely compute intensive.

Solution

Here's the solution to imgval assignment, which was left as an exercise above.

imgval = image[hheight:hheight+rows - height+1, wwidth:wwidth+cols - width +1]

Since this appeared to be a homework question, I left the solution incomplete. I'll come back and fill in the `TODO` portion at a later date. _Hint1_: It follows the same general form as that of `applyFilterMethod2`, but the translation is not trivial. _Hint2_: If you get it wrong, the Python interpreter should give you details about the slice incompatibilities in the array operations. — Brent Bradburn, Mar 14 '21 at 23:15
Although this may have been someone's homework, the concepts addressed may be relevant for future visitors. The vectorization in Method2 and Method3 respectively produce performance gains of 3x and 50x. — Brent Bradburn, Mar 14 '21 at 23:22
[What is “vectorization”?](https://stackoverflow.com/q/1422149/86967) — Brent Bradburn, Mar 15 '21 at 02:13
[What are vectors and how are they used in programming?](https://stackoverflow.com/questions/508374/what-are-vectors-and-how-are-they-used-in-programming) — Brent Bradburn, Mar 15 '21 at 02:22

Stuart · Answer 2 · 2021-03-17T20:09:21.477

You can construct an array of sliced views of the image, each shifted by the indices of the weights array, and then multiply it by the weights and take the sum.

def apply_filter(image: np.array, weights: np.array) -> np.array:
    height, width = weights.shape
    indices = np.indices(weights.shape).T.reshape(weights.size, 2)
    views = np.array([image[r:-height+r,c:-width+c] for r, c in indices])
    return np.inner(views.T, weights.T.flatten()).T  # sum product

(I had to transpose and reshape at several points to get the data into the desired shapes and order. There may be simpler ways.)

There is still a sneaky for loop in the form of a list comprehension over the weights indices, but we minimize the operations inside the for loop to creating a set of slice views. The loop could potentially be avoided using sliding_window_view, but it's not clear if that would improve performance; or stride_tricks.as_strided (see answers to this question).

Remove for loops for faster execution - vectorize

2 Answers2

Discussion

Solution