3

Note on duplicate message:

Similar themes, not exactly a duplicate. Esp. since the loop is still the fastest method. Thanks.

Goal:

Upscale an array from [small,small] to [big,big] by a factor quickly, don't use an image library. Very simple scaling, one small value will become several big values, after it is normalized for the several big values it becomes. In other words, this is "flux conserving" from an astronomical wording - a value of 16 from the small array spread into a big array's 4 values (factor of 2) would be 4 4's so the amount of the value has been retained.

Problem:

I've got some working codes to do the upscaling, but they don't work very fast compared to downscaling. Upscaling is actually easier than downscaling (which requires many sums, in this basic case) - upscaling just requires already-known data to be put in big chunks of a preallocated array.

For a working example, a [2,2] array of [16,24;8,16]:

16 , 24

8 , 16

Multiplied by a factor of 2 for a [4,4] array would have the values:

4 , 4 , 6 , 6

4 , 4 , 6 , 6

2 , 2 , 4 , 4

2 , 2 , 4 , 4

The fastest implementation is a for loop accelerated by numba's jit & prange. I'd like to better leverage Numpy's pre-compiled functions to get this job done. I'll also entertain Scipy stuff - but not its resizing functions.

It seems like a perfect problem for strong matrix manipulation functions, but I just haven't managed to make it happen quickly.

Additionally, the single-line numpy call is way funky, so don't be surprized. But it's what it took to get it to align correctly.

Code examples:

Check more optimized calls below Be warned, the case I have here makes a 20480x20480 float64 array that can take up a fair bit of memory - but can show off if a method is too memory intensive (as matrices can be).

Environment: Python 3, Windows, i5-4960K @ 4.5 GHz. Time to run for loop code is ~18.9 sec, time to run numpy code is ~52.5 sec on the shown examples.

% MAIN: To run these

import timeit

timeitSetup = ''' 
from Regridder1 import Regridder1
import numpy as np

factor = 10;

inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';

print("Time to run 1: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder1(inArray, factor,)", number = 10) ));

timeitSetup = ''' 
from Regridder2 import Regridder2
import numpy as np

factor = 10;

inArrayX = np.float64(np.arange(0,2048,1));
inArrayY = np.float64(np.arange(0,2048,1));
[inArray, _] = np.meshgrid(inArrayX,inArrayY);
''';

print("Time to run 2: {}".format( timeit.timeit(setup=timeitSetup,stmt="Regridder2(inArray, factor,)", number = 10) ));

% FUN: Regridder 1 - for loop

import numpy as np
from numba import prange, jit

@jit(nogil=True)
def Regridder1(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels
    outArray = np.zeros(outSize); #preallcoate
    outBlocks = inArray/outBlockSize; #precalc the resized blocks to go faster
    for i in prange(0,inSize[0]):
        for j in prange(0,inSize[1]):
            outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j]; #puts normalized value in a bunch of places

    return outArray;

% FUN: Regridder 2 - numpy

import numpy as np

def Regridder2(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels

    outArray = inArray.repeat(factor).reshape(inSize[0],factor*inSize[1]).T.repeat(factor).reshape(inSize[0]*factor,inSize[1]*factor).T/outBlockSize;

    return outArray;

Would greatly appreciate insight into speeding this up. Hopefully code is good, formulated it in the text box.

Current best solution:

On my comp, the numba's jit for loop implementation (Regridder1) with jit applied to only what needs it can run the timeit test at 18.0 sec, while the numpy only implementation (Regridder2) runs the timeit test at 18.5 sec. The bonus is that on the first call, the numpy only implementation doesn't need to wait for jit to compile the code. Jit's cache=True lets it not compile on subsequent runs. The other calls (nogil, nopython, prange) don't seem to help but also don't seem to hurt. Maybe in future numba updates they'll do better or something.

For simplicity and portability, Regridder2 is the best option. It's nearly as fast, and doesn't need numba installed (which for my Anaconda install required me to go install it) - so it'll help portability.

% FUN: Regridder 1 - for loop

import numpy as np

def Regridder1(inArray,factor):
    inSize = np.shape(inArray);
    outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))];

    outBlockSize = factor*factor #the block size where 1 inArray pixel is spread across # outArray pixels
    outArray = np.empty(outSize) #preallcoate
    outBlocks = inArray/outBlockSize #precalc the resized blocks to go faster
    factor = np.int64(factor) #convert to an integer to be safe (in case it's a 1.0 float)

    outArray = RegridderUpscale(inSize, factor, outArray, outBlocks) #call a function that has just the loop

    return outArray;
#END def Regridder1

from numba import jit, prange
@jit(nogil=True, nopython=True, cache=True) #nopython=True, nogil=True, parallel=True, cache=True
def RegridderUpscale(inSize, factor, outArray, outBlocks ):
    for i in prange(0,inSize[0]):
        for j in prange(0,inSize[1]):
            outArray[i*factor:(i*factor+factor),j*factor:(j*factor+factor)] = outBlocks[i,j];
        #END for j
    #END for i
    #scales the original data up, note for other languages you need i*factor+factor-1 because slicing
    return outArray; #return success
#END def RegridderUpscale

% FUN: Regridder 2 - numpy based on @ZisIsNotZis's answer

import numpy as np

def Regridder2(inArray,factor):
    inSize = np.shape(inArray);
    #outSize = [np.int64(np.round(inSize[0] * factor)), np.int64(np.round(inSize[1] * factor))]; #whoops

    outBlockSize = factor*factor; #the block size where 1 inArray pixel is spread across # outArray pixels

    outArray = np.broadcast_to( inArray[:,None,:,None]/outBlockSize, (inSize[0], factor, inSize[1], factor)).reshape(np.int64(factor*inSize[0]), np.int64(factor*inSize[1])); #single line call that gets the job done

    return outArray;
#END def Regridder2
user2403531
  • 688
  • 6
  • 15
  • Is the `[4,4]` array example the desired output?, or... i am confused? – U13-Forward Nov 16 '18 at 03:20
  • It could be an output (I included it as a visual example of what type of scaling is desired), but the 2048 -> 20480 in the code shows real world speed limitations much better. – user2403531 Nov 16 '18 at 03:48
  • It's a teensy bit faster to do the division first, before calling `repeat` in Regridder2 (as you already did in Regridder1). i.e. `outArray = (inArray/outBlockSize).repeat(...)...` – unutbu Nov 16 '18 at 03:56
  • No need to compute `outSize` in Regridder2. – unutbu Nov 16 '18 at 04:00
  • `outArray = ((inArray/outBlockSize).repeat(outBlockSize).reshape(inSize[0],inSize[1],factor,factor).swapaxes(1,2).reshape(inSize[0]*factor,inSize[1]*factor))` is a marginally faster way to compute `outArray` in Regridder2, but nowhere near as fast as Regridder1. – unutbu Nov 16 '18 at 04:01
  • @unutbu I don't think single `repeat` is possible because it have to repeat both horizontally (with stride 1) and vertically (with stride N) while a single `repeat` call will only repeat in one direction. Like `[1,2,3,4]` will become `[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]` or `[1,2,1,2,1,2,1,2,3,4,3,4,3,4,3,4]` instead of `[1,1,2,2,1,1,2,2,3,3,4,4,3,3,4,4]` – ZisIsNotZis Nov 16 '18 at 04:19
  • @ZisIsNotZis: The [`reshape/swapaxes/reshape` idiom](https://stackoverflow.com/a/16858283/190597) places the repeated values in the desired "blocks". – unutbu Nov 16 '18 at 04:24
  • @unutbu Oh I got it. The `reshape` will copy data to the correct position. – ZisIsNotZis Nov 16 '18 at 04:30

2 Answers2

4

I did some benchmarks about this using a 512x512 byte image (10x upscale):

a = np.empty((512, 512), 'B')

Repeat Twice

>>> %timeit a.repeat(10, 0).repeat(10, 1)
127 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Repeat Once + Reshape

>>> %timeit a.repeat(100).reshape(512, 512, 10, 10).swapaxes(1, 2).reshape(5120, 5120)
150 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The two methods above all involve copying twice, while two methods below all copies once.

Fancy Indexing

Since t can be repeatedly used (and pre-computed), it is not timed.

>>> t = np.arange(512, dtype='B').repeat(10)
>>> %timeit a[t[:,None], t]
143 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Viewing + Reshape

>>> %timeit np.broadcast_to(a[:,None,:,None], (512, 10, 512, 10)).reshape(5120, 5120)
29.6 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It seems that viewing + reshape wins (at least on my machine). The test result on 2048x2048 byte image is the following where view + reshape still wins

2.04 s ± 31.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.4 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.3 s ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
424 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while the result for 2048x2048 float64 image is

3.14 s ± 20.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.07 s ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.56 s ± 64.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.8 s ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

which, though the itemsize is 8 times larger, didn't take much more time

ZisIsNotZis
  • 1,570
  • 1
  • 13
  • 30
  • Viewing + Reshape was indeed fast - fast enough to meet jit (at 2048->20480 jit for loop is 18.0 sec and this is 18.5 sec on my comp)! It also had the bonus that it can handle non-square arrays, while my .repeat.repeat function doesn't. It's only a hair slower, but it removes the need to rely on numba's at-times finicky jit. Thank you for the insight! – user2403531 Nov 19 '18 at 19:43
3

Some new functions which show that order of operations is important :

import numpy as np
from numba import jit

A=np.random.rand(2048,2048)

@jit
def reg1(A,factor):
    factor2=factor**2
    a,b = [factor*s for s in A.shape]
    B=np.empty((a,b),A.dtype)
    Bf=B.ravel()
    k=0
    for i in range(A.shape[0]):
        Ai=A[i]
        for _ in range(factor):
            for j in range(A.shape[1]):
                x=Ai[j]/factor2
                for _ in range(factor):
                    Bf[k]=x
                    k += 1
    return B   

def reg2(A,factor):
    return np.repeat(np.repeat(A/factor**2,factor,0),factor,1)

def reg3(A,factor):
    return np.repeat(np.repeat(A/factor**2,factor,1),factor,0)

def reg4(A,factor):
    shx,shy=A.shape
    stx,sty=A.strides
    B=np.broadcast_to((A/factor**2).reshape(shx,1,shy,1),
    shape=(shx,factor,shy,factor))
    return B.reshape(shx*factor,shy*factor) 

And runs :

In [47]: %timeit _=Regridder1(A,5)
672 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [48]: %timeit _=reg1(A,5)
522 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [49]: %timeit _=reg2(A,5)
1.23 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [50]: %timeit _=reg3(A,5)
782 ms ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit _=reg4(A,5)
860 ms ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"""
B. M.
  • 18,243
  • 2
  • 35
  • 54
  • 1
    Wow repeating horizontally first do is much faster than repeating vertically first, similar to reshaping `view`. I guess that's because the first `repeat` takes negligible time compares to the second `repeat`. If the second `repeat` is done "vertically", numpy is basically copying large continuous memory (i.e. 20480-len row) multiple times, while if done "horizontally", numpy have to copy continuous 2048-len rows repeatedly (CPU don't native-ly support non-continuous array) – ZisIsNotZis Nov 19 '18 at 02:26