How to Update a very large ushort[] image very fast

Question

everyone! I need to update the entire ushort[] array very quickly

Right now I am using the following code:

public ushort[] ImageUpdatePixels(ushort[] data)

        var uMax = 65535d;
        var w = someValue;
        var l = someOtherValue;

        for (int y = 0; y < height; y++)
            for (int x = 0; x < width; x++)
            {
                var index = y * width + x;
                var pixelValue = data[index] / uMax;

                pixelValue = pixelValue / w + 0.5f - l / w;
                pixelValue = Math.Min(1, Math.Max(0, pixelValue));
                pixelValue = Math.Pow(pixelValue, 1 / imageSettings.Gamma);

                data[index] = (ushort)(pixelValue * uMax);
            } 
return data;

When image resolution is 1000x1000 - everything works fine, but if 2k*2k or more - CPU resources of computer are already not enough.

Are there any other ways to go through the whole array very fast and apply the right conversions?

Not a huge difference, but you need only one loop over `width*height` elements, then you don’t need to compute `index`. Hopefully your optimizing compiler already does this, but you could pre-compute things such as `1 / imageSettings.Gamma` before the loop. — Cris Luengo, Jul 17 '22 at 17:27
Can you convert this to only work in `ushort` integers? If so it should be a good bit faster. Also you could vectorize it, which would massively speed it up, but `Math.Pow` will be a problem, can you eliminate that? — Charlieface, Jul 17 '22 at 17:54
@CrisLuengo Thanks, really hadn't noticed that. But still not enough for a PC. The cycle through a couple of million pixels is too long( — Евгений Федоров, Jul 17 '22 at 18:16
@Charlieface The pixel work assumes that I need to calculate float values for each image inside the loop. — Евгений Федоров, Jul 17 '22 at 18:18
Yeah, I said it wasn’t a big difference. Just cleaner code and a bit of speed increase. This is why I didn’t post it as an answer, it’s just a comment. An answer would probably point you to parallel processing, and the use of a look-up table instead of a computation for every pixel. — Cris Luengo, Jul 17 '22 at 18:20
how long does this code take, and for what size of input (state all values of all variables) and what "optimizations" are enabled? on what hardware do you run this? — Christoph Rackwitz, Jul 17 '22 at 20:55
@ChristophRackwitz - Stopwatch timer shows ~00:00:00:180-250 ms. data - {ushort[8294400]} ; w = 0.005859464408331426; l = 0.086915388723582815; what "optimizations" ? — Евгений Федоров, Jul 18 '22 at 04:51
Split process in parts, don't read all the image at once. As sample add another for or a thread to read image part by part then apply manipulation. — redParrot, Jul 21 '22 at 11:59

score 0 · Accepted Answer · answered Jul 29 '22 at 20:14

0

There's no getting around this being an O(N^2) operation--But there are things we can do to help. In general, here are some things you can do to improve floating-point performance:

Remove as many divisions as possible while staying true to the algorithm
Remove as many Math.Pow's as possible while staying true to the algorithm

But how slow is it?

I inserted this code into a small testing harness. N is the width and height of the image in pixels

I removed the inner loop (for (int x = 0; x < width; x++)) and replaced it with one large loop which goes over the array to create OneLoop. This is 3% faster.
I manually vectorized the code (using Intel SIMD instructions). This made the algorithm approximately 10% faster

Method	N	Mean	Error	StdDev	Ratio
Base	2000	38.567 ms	0.2861 ms	0.2389 ms	1.00
OneLoop	2000	37.863 ms	0.6156 ms	0.5457 ms	0.98
Vectorized	2000	34.362 ms	0.3255 ms	0.2885 ms	0.89

Base	4000	155.476 ms	1.9151 ms	1.7913 ms	1.00
OneLoop	4000	149.369 ms	0.7717 ms	0.6444 ms	0.96
Vectorized	4000	136.207 ms	0.7508 ms	0.6656 ms	0.88

What about removing `Math.Pow`?

By simply removing all occurences of Math.Pow, this happened:

Method	N	Mean	Error	StdDev	Ratio	RatioSD
VectorizedNoPow	2000	4.335 ms	0.0852 ms	0.1276 ms	0.47	0.02
BaseNoPow	2000	9.293 ms	0.1563 ms	0.1386 ms	1.00	0.00

VectorizedNoPow	4000	17.929 ms	0.2946 ms	0.2611 ms	0.48	0.01
BaseNoPow	4000	37.412 ms	0.7114 ms	0.5940 ms	1.00	0.00

Yep, it's almost eight times faster. Across the board.

TL;DR:

Math.Pow is bogging down your code and making the update loop very, very slow. To improve performance, you should re-do the algorithm to remove all reliance on Math.Pow. It will also help if you use C# intrinsics to vectorize your code, but that is optional. Best of luck!

answered Jul 29 '22 at 20:14

Mooshua

526
2
16

Since you took the time to make vectorized versions, it would be good to include the code in your answer. When you say you removed Pow, you mean you just left out that work entirely, so the function doesn't do anywhere near the same thing anymore, just storing the FP `min` calculation? – Peter Cordes Jul 29 '22 at 20:34
FP power is a non-trivial operation, and it's not like it's easy to replace with cheaper operations. You might be able to approximate it with a polynomial or something, but probably to get close to the same result, you'd need to manually do some log/exp stuff. You can vectorize fast approximations of those which should get within one `ushort`, nowhere near as much precision needed as for 1ulp of a float. So yeah that could help. e.g. a polynomial approximation for the mantissa with maybe only 4 coefficients, but still SIMD integer bit-manip of the FP exp/mantissa fields. – Peter Cordes Jul 29 '22 at 20:36
e.g. see [Fastest Implementation of Exponential Function Using AVX](https://stackoverflow.com/q/48863719) / [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089) (my answer has some stuff about float, not just double). – Peter Cordes Jul 29 '22 at 20:37
1

@PeterCordes Yep, just straight up removing the pow--no replacement. There are ways to estimate it in the context of gamma calculations (such as linear interpolation over a set of points), but my intention was just to show how much time in the loop was spent performing the operation. Thanks for posting that AVX approximation resource, though, that looks handy! As for the vectorized code, I'll publish that shortly :) – Mooshua Jul 29 '22 at 22:16
Even without removing `Math.Pow`, you can only ever have 65536 different versions of that operation, since that's the range of the input, and that's vastly less than the actual image sizes they're apparently dealing with... so you could just make an array to cache the result whenever doing one, and look up the cached result before deciding to execute `Math.Pow`. Just takes 150-200 k of extra memory, depending on the implementation (either an `int[65536]` array where you can put in -1 as unfilled, or a `ushort[65536]` array plus a `bool[65536]` to indicate which values have been done so far). – Nyerguds Aug 11 '22 at 04:54
1

Heck, if it's done on multiple images, it might be worth it to precalculate all those values once and feed them into the function multiple times. Or retain the caching from previous image operations. – Nyerguds Aug 11 '22 at 04:57

How to Update a very large ushort[] image very fast

1 Answers1

But how slow is it?

What about removing Math.Pow?

TL;DR:

What about removing `Math.Pow`?