4

I have a very large 3d array

large = np.zeros((2000, 1500, 700))

Actually, large is an image but for each coordinate, it has 700 values. Also, I have 400 bounding boxes. Bounding boxes do not have a fixed shape. I store a tuple of lower and upper bound coordinates for each box as follows

boxes_y = [(y_lower0, y_upper0), (y_lower1, y_upper1), ..., (y_lower399, y_upper399)]
boxes_x = [(x_lower0, x_upper0), (x_lower1, x_upper1), ..., (x_lower399, x_upper399)]

Then, for each box, I want to fill the corresponding region in large array with a vector of size 700. Specifically, I have an embeddings array for each box

embeddings = np.random.rand(400, 700) # In real case, these are not random. Just consider the shape

What I want to do is

for i in range(400):
   large[boxes_y[i][0]: boxes_y[i][1], boxes_x[i][0]: boxes_x[i][1]] = embeddings[i]

This works but it is too slow for such a large large array. I am looking for vectorizing this computation.

Shadovx
  • 115
  • 1
  • 8
  • 2
    Would it be ok to switch to smaller dtype? `np.uint8`, `np.int16` or at least `np.float32`? – dankal444 Dec 02 '21 at 22:30
  • Yeah, I can switch to `np.float32`. Thanks... Apart from that, I think there is no way to vectorize this. The biggest obstacle is, I suppose, shapes of the bounding boxes are not fixed. I think there are some methods about image labelling and bounding boxes in `scipy` but I have not investigated in depth yet. – Shadovx Dec 03 '21 at 08:32
  • 1
    You could try to first determine for each pixel which bounding box should land where -> (2000, 1500) array, then use this to vectorize whole thing. I am afraid this might not help much in case of speed - would try it though. – dankal444 Dec 03 '21 at 09:02

1 Answers1

4

One big problem is that the input is really huge (~15.6 GiB). Another is that it is travelled up to 400 times in the worst case (resulting in up to 6240 GiB to be written in RAM). The thing is that overlapping regions written multiple times.

A better solution is to iterate over the two first dimensions (the one of the "image") to find which bounding box must be copied as proposed by @dankal444. This is similar to what Z-buffer-based algorithms does in computer graphics.

Based on this, an even better solution is to use a scanline-rendering algorithm. In your case, the algorithm is much simpler than the traditional one since you are working with bounding-boxes and not complex polygons. For each scanline (2000 here), you can quickly filter the bounding box that are written to the scanline and then iterate over them. The classical algorithm is a bit too complex for your simple case. For each scanline, iterating over the filtered bounding-box and overwriting the their indices in each pixel is enough. This operation can be done using Numba in parallel. It is very fast because the computation is mainly performed in the CPU cache.

The final operation is to perform the actual data writes based on the previous indices (still using Numba in parallel). This operation is still memory bound, but the output array is only written once (only 15.6 GiB of RAM will be written in the worst case, and 7.8 GiB for float32 items). This should take a fraction of a second on most machine. If this is not enough, you could try to use dedicated GPUs since the GPU RAM is often significantly faster than the main RAM (typically about an order of magnitude faster).

Here is the implementation:

# Assume the last dimension of `large` and `embeddings` is contiguous in memory
@nb.njit('void(float32[:,:,::1], float32[:,::1], int_[:,::1], int_[:,::1])', parallel=True)
def fastFill(large, embeddings, boxes_y, boxes_x):
    n, m, l = large.shape
    boxCount = embeddings.shape[0]
    assert embeddings.shape == (boxCount, l)
    assert boxes_y.shape == (boxCount, 2)
    assert boxes_x.shape == (boxCount, 2)
    imageBoxIds = np.full((n, m), -1, dtype=np.int16)
    for y in nb.prange(n):
        # Filtering -- A sort is not required since the number of bounding-box is small
        boxIds = np.where((boxes_y[:,0] <= y) & (y < boxes_y[:,1]))[0]
        for k in boxIds:
            lower, upper = boxes_x[k]
            imageBoxIds[y, lower:upper] = k
    # Actual filling
    for y in nb.prange(n):
        for x in range(m):
            boxId = imageBoxIds[y, x]
            if boxId >= 0:
                large[y, x, :] = embeddings[boxId]

Here is the benchmark:

large = np.zeros((1000, 750, 700), dtype=np.float32)  # 8 times smaller in memory
boxes_y = np.cumsum(np.random.randint(0, large.shape[0]//2, size=(400, 2)), axis=1)
boxes_x = np.cumsum(np.random.randint(0, large.shape[1]//2, size=(400, 2)), axis=1)
embeddings = np.random.rand(400, 700).astype(np.float32)

# Called many times
for i in range(400):
   large[boxes_y[i][0]:boxes_y[i][1], boxes_x[i][0]:boxes_x[i][1]] = embeddings[i]

# Called many times
fastFill(large, embeddings, boxes_y, boxes_x)

Here are results on my machine:

Initial code:        2.71 s
Numba (sequential):  0.13 s
Numba (parallel):    0.12 s   (x22 times faster than the initial code)

Note that the first run is slower because of virtual zero-mapped memory. The Numba version is still about 10 times faster in this case.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • 1
    I wonder if there would be still any gain if there were no overlapping regions. I guess those speed comparisons will depend much on degree of "overlapness" of the boxes. – dankal444 Dec 06 '21 at 17:56
  • 1
    Yes, the Numpy algorithm is slow because it writes different values at the same location several times. Without overlap, the Numpy code should be very competitive with the provided solution since the Numpy assignment used in the original code is very optimized internally and the cost of the Python loop should be very small here. In fact, Numpy can generate a faster code on some machine since it can better use the memory than the current Numba code. However, the Numba code can be faster on some platforms due to parallelism (1 core is not always enough to saturate the RAM bandwidth). – Jérôme Richard Dec 06 '21 at 22:33