Split resize algorithm into two passes

Question

I have written the following resizing algorithm which can correctly scale an image up or down. It's far too slow though due to the inner iteration through the array of weights on each loop.

I'm fairly certain I should be able to split out the algorithm into two passes much like you would with a two pass Gaussian blur which would vastly reduce the operational complexity and speed up performance. Unfortunately I can't get it to work. Would anyone be able to help?

Parallel.For(
    startY,
    endY,
    y =>
    {
        if (y >= targetY && y < targetBottom)
        {
            Weight[] verticalValues = this.verticalWeights[y].Values;

            for (int x = startX; x < endX; x++)
            {
                Weight[] horizontalValues = this.horizontalWeights[x].Values;

                // Destination color components
                Color destination = new Color();

                // This is where there is too much operation complexity.
                foreach (Weight yw in verticalValues)
                {
                    int originY = yw.Index;

                    foreach (Weight xw in horizontalValues)
                    {
                        int originX = xw.Index;
                        Color sourceColor = Color.Expand(source[originX, originY]);
                        float weight = yw.Value * xw.Value;
                        destination += sourceColor * weight;
                    }
                }

                destination = Color.Compress(destination);
                target[x, y] = destination;
            }
        }
    });

Weights and indices are calculated as follows. One for each dimension:

/// <summary>
/// Computes the weights to apply at each pixel when resizing.
/// </summary>
/// <param name="destinationSize">The destination section size.</param>
/// <param name="sourceSize">The source section size.</param>
/// <returns>
/// The <see cref="T:Weights[]"/>.
/// </returns>
private Weights[] PrecomputeWeights(int destinationSize, int sourceSize)
{
    IResampler sampler = this.Sampler;
    float ratio = sourceSize / (float)destinationSize;
    float scale = ratio;

    // When shrinking, broaden the effective kernel support so that we still
    // visit every source pixel.
    if (scale < 1)
    {
        scale = 1;
    }

    float scaledRadius = (float)Math.Ceiling(scale * sampler.Radius);
    Weights[] result = new Weights[destinationSize];

    // Make the weights slices, one source for each column or row.
    Parallel.For(
        0,
        destinationSize,
        i =>
            {
                float center = ((i + .5f) * ratio) - 0.5f;
                int start = (int)Math.Ceiling(center - scaledRadius);

                if (start < 0)
                {
                    start = 0;
                }

                int end = (int)Math.Floor(center + scaledRadius);

                if (end > sourceSize)
                {
                    end = sourceSize;

                    if (end < start)
                    {
                        end = start;
                    }
                }

                float sum = 0;
                result[i] = new Weights();

                List<Weight> builder = new List<Weight>();
                for (int a = start; a < end; a++)
                {
                    float w = sampler.GetValue((a - center) / scale);

                    if (w < 0 || w > 0)
                    {
                        sum += w;
                        builder.Add(new Weight(a, w));
                    }
                }

                // Normalise the values
                if (sum > 0 || sum < 0)
                {
                    builder.ForEach(w => w.Value /= sum);
                }

                result[i].Values = builder.ToArray();
                result[i].Sum = sum;
            });

    return result;
}

/// <summary>
/// Represents the weight to be added to a scaled pixel.
/// </summary>
protected class Weight
{
    /// <summary>
    /// The pixel index.
    /// </summary>
    public readonly int Index;

    /// <summary>
    /// Initializes a new instance of the <see cref="Weight"/> class.
    /// </summary>
    /// <param name="index">The index.</param>
    /// <param name="value">The value.</param>
    public Weight(int index, float value)
    {
        this.Index = index;
        this.Value = value;
    }

    /// <summary>
    /// Gets or sets the result of the interpolation algorithm.
    /// </summary>
    public float Value { get; set; }
}

/// <summary>
/// Represents a collection of weights and their sum.
/// </summary>
protected class Weights
{
    /// <summary>
    /// Gets or sets the values.
    /// </summary>
    public Weight[] Values { get; set; }

    /// <summary>
    /// Gets or sets the sum.
    /// </summary>
    public float Sum { get; set; }
}

Each IResampler provides the appropriate series of weights based on the given index. the bicubic resampler works as follows.

/// <summary>
/// The function implements the bicubic kernel algorithm W(x) as described on
/// <see href="https://en.wikipedia.org/wiki/Bicubic_interpolation#Bicubic_convolution_algorithm">Wikipedia</see>
/// A commonly used algorithm within imageprocessing that preserves sharpness better than triangle interpolation.
/// </summary>
public class BicubicResampler : IResampler
{
    /// <inheritdoc/>
    public float Radius => 2;

    /// <inheritdoc/>
    public float GetValue(float x)
    {
        // The coefficient.
        float a = -0.5f;

        if (x < 0)
        {
            x = -x;
        }

        float result = 0;

        if (x <= 1)
        {
            result = (((1.5f * x) - 2.5f) * x * x) + 1;
        }
        else if (x < 2)
        {
            result = (((((a * x) + 2.5f) * x) - 4) * x) + 2;
        }

        return result;
    }
}

Here's an example of an image resized by the existing algorithm. The output is correct (note the silvery sheen is preserved).

Original image

Image halved in size using the bicubic resampler.

The code is part of a much larger library that I am writing to add image processing to corefx.

what are the indexes and weight values? without it there is no way to compute the recursion subdivision ... other then brute force ... and only if some properties are met ... also a good idea would be to share sample image before and after filtering so there is something to compare with ... — Spektre, Jan 15 '16 at 08:32
@Spektre I've added a lot more information to the question. If there is anything else you need please let me know. — James South, Jan 16 '16 at 00:47
well as this is bi-cubic (not a arbitrary convolution what I assumed from your code at first not too deep look) then this is not solvable by recursion subdivision. Instead this can be separated into 2x1D problem as you intend but I am not sure it will be faster (on nowadays computers) then direct 2D re-sampling. Have to make some tests about it but will have not time for this till Monday. — Spektre, Jan 16 '16 at 08:35
Yeah bicubic is one of many algorithms I use. That would be great if you could do some tests. When is switched my Gaussian blur algorithm to 2x1D it yielded much better performance so fingers crossed this should also . Monday is absolutely fine, No work at the weekend :) — James South, Jan 16 '16 at 10:14
Gauss separated from 2D to 2x1D is better because it is 2D integral. The bi-cubic filtration is not an integral so do not expect such high speed up. also the gausian 1D can be sometimes speeded up from `O(n)` to `O(log(n))` by recursive subdivision similar to FFT algorithms ... n x integrating on smaller area is faster ... — Spektre, Jan 16 '16 at 11:40
I am no image-processing expert, but have experience with optimizing code for performance. Seems like the first code section is what you are trying get better performance. If you can make your top code section compilable and executable without too much of other code dependencies, then I can help you. In either case your code does not seem to take advantage of cache-line — Vikhram, Jan 16 '16 at 12:02
@Vikhram The code itself is part of a larger library which in itself doesn't have any dependencies other than corefx. Splitting it out wouldn't allow testing of image output. — James South, Jan 16 '16 at 12:22
@JamesSouth the 2 x 1D cubic resample (680ms) is twice as slow as 1x bi-cubic resample (380ms) for your input image. So as I was afraid the subdivision will only get things worse. What polynomial you use? mine get a bit aliased on the eyeballs. — Spektre, Jan 18 '16 at 09:37
@Spektre Damn, I was hoping for good news. 380ms seems awfully slow to me. "What polynomial" I'm sorry, I'm not sure what you mean. — James South, Jan 18 '16 at 09:47
@JamesSouth that is not optimized code because it supports variable pixel format (hence many if statements per pixel. Tested on non threaded 32bit app win7 x64 3.2GHz AMD CPU pure SW no GPU acceleration). What time you got and what you want to achieve? For fixed pixelformat should be **way faster**. For bi-cubic you can use any polynomial that follows certain rules like continuity between patches and shape ... I use this: [interpolation cubic](http://stackoverflow.com/a/20517874/2521214) — Spektre, Jan 18 '16 at 10:11
@Spektre You're way beyond my ability to understand now. I'm teaching myself this stuff as I go along. What you see there is the core of my resampler with fixed rgba each color component a float. I meant slow compared to System.Drawing. My speeds vary from machine to machine, 2sec to decode,resize,encode on my laptop. — James South, Jan 18 '16 at 11:12

Micromega · Answer 1 · 2016-01-21T12:27:32.707

1

You can try a weighted voronoi diagram. Try randomly a set of points and compute the voronoi diagram. Smooth the polygons with Lloyd's algorithm and interpolate the color of the polygon. With the weight compute the weighted voronoi diagram. For example voronoi stippling and mosaic:http://www.mrl.nyu.edu/~ajsecord/stipples.html and http://www.evilmadscientist.com/2012/stipplegen-weighted-voronoi-stippling-and-tsp-paths-in-processing/.

edited Jan 21 '16 at 12:27

answered Jan 19 '16 at 16:44

Micromega

12,486
7
35
72

Thanks I'll have a look at this. – James South Jan 26 '16 at 03:40

score 1 · Answer 2 · answered Jan 21 '16 at 10:08

Judging from an abstract point of view (not knowing much about image manipulation) I think it looks like you're computing the values for weight and sourcecolor (in the innermost foreach loop) multiple times (whenever the same pair of indices crops up again); would it be feasible to simply precompute them in advance? You would need to compute a 'direct product' Matrix for your HorizontalWeight and VerticalWeight matrices (simply multiplying the values for each pair of indices (x, y)) and could also apply Color.Expand to the source in advance. These tasks can be done in parallel and the 'direct product' (sorry, I don't know the correct Name for that beast) should be available in lots of libraries.

I think I follow what you're saying thanks. I'll experiment with that approach. — James South, Jan 26 '16 at 03:41

score 0 · Accepted Answer · answered Jan 26 '16 at 03:39

Ok so here's how I went about it.

The trick is to first resize only the width of the image keeping the height the same as the original image. We store the resultant pixels in a temporary image.

Then it's a case of resizing that image down to our final output.

As you can see we are no longer iterating through both weight collections on each pixel. Despite having to iterate though the outer pixel loop twice the algorithm was much faster in it's operation averaging around 25% faster on my test images.

// Interpolate the image using the calculated weights.
// First process the columns.
Parallel.For(
    0,
    sourceBottom,
    y =>
    {
        for (int x = startX; x < endX; x++)
        {
            Weight[] horizontalValues = this.HorizontalWeights[x].Values;

            // Destination color components
            Color destination = new Color();

            foreach (Weight xw in horizontalValues)
            {
                int originX = xw.Index;
                Color sourceColor = Color.Expand(source[originX, y]);
                destination += sourceColor * xw.Value;
            }

            destination = Color.Compress(destination);
            this.firstPass[x, y] = destination;
        }
    });

// Now process the rows.
Parallel.For(
    startY,
    endY,
    y =>
    {
        if (y >= targetY && y < targetBottom)
        {
            Weight[] verticalValues = this.VerticalWeights[y].Values;

            for (int x = startX; x < endX; x++)
            {
                // Destination color components
                Color destination = new Color();

                foreach (Weight yw in verticalValues)
                {
                    int originY = yw.Index;
                    int originX = x;
                    Color sourceColor = Color.Expand(this.firstPass[originX, originY]);
                    destination += sourceColor * yw.Value;
                }

                destination = Color.Compress(destination);
                target[x, y] = destination;
            }
        }
    });

The big speed gains are when you are exclusively reading consecutive areas of memory. To do that, you have to transpose the matrix twice - once after scaling X, then after scaling Y. The upside is that you can transpose during write operations, which the CPU can do asynchronously if there are no read dependencies on that region of memory. — Lilith River, Jan 27 '16 at 18:28
This should simplify your code - you now only know how to scale in one direction, and you always transpose when writing. You call Scale1D twice, passing in different widths/heights and weighting structures each time. Now I've only tested this in C, so array bounds checks may slow things down, but I think the underlying cache coherency benefits should work within the CLR JIT as well. This transpose-when-writing, read-consecutive approach alone seemed to provide close to an order of magnitude in performance gains. — Lilith River, Jan 27 '16 at 18:30

Split resize algorithm into two passes

3 Answers3