Is it possible for a dilation or erosion algorithm to not use a separate array?

Question

I have an assignment where I am to write an inline x86 assembly code for a dilation and an erosion function. My problem is that we are not given a separate array and can't touch the non-asm part of the program. So I need to find a way to alter the original image without copying it elsewhere. But if I do so, the process gets compromised because the later pixels take into account the altered values of their neighbouring pixels instead of their original ones and the entire image becomes black or white.

Below is one of the functions. I am putting it so the restrictions I mentioned are clear, not because I expect someone to write it for me. I cannot initialize a separate array inside the asm block and I'm not given one by the rest of the code either.

void dilation(int image_size, int filter_size, short* image_org) {
    __asm {
        MOV EBX, image_org
        //I can only write code in here
    }
}

Edit: I think I'm supposed to push the new value of each pixel to the stack and only after I finish traversing through the whole image place them back inside the array.

You could call `malloc` to allocate a new array. It should succeed to cache a single line of the image's original pixels for use with the next one. — fuz, Jan 14 '23 at 23:40
@fuz is there an asm equivalent of `malloc`? Because Im not allowed to write outside the `__asm{}` block. — Fatih Fatih, Jan 14 '23 at 23:44
You can push the number of bytes to allocate, then `call _malloc` and retrieve a pointer to the allocated memory region from eax. Don't forget to popo the argument back off the stack afterwards. — fuz, Jan 14 '23 at 23:59
You don't need to push pixels to stack one by one, you can allocate space on stack for whole image at once. Actually you don't need to keep whole image, just a stripe of pixels which overlap with kernel of filter. — dimich, Jan 15 '23 at 03:21
I think your instructor needs to clarify this assignment. **in-place** morphology isn't straight-forward, if possible at all. you really should ask the instructor if you can use `malloc()` -- why do you have only one dimension for `image_size`? is this a 1D signal, or a square 2D image? in the 1D case, you can calculate erosion/dilation in-place using one forward and one backward pass. -- "use the stack"? that sounds like a terrible idea. stacks have fairly limited size. only very small pictures would fit on the stack. — Christoph Rackwitz, Jan 15 '23 at 13:44
@ChristophRackwitz the value of each pixel is stored in a one dimensional array. The test image we use is 512x512 pixels which i assume is fairly small because when i tried pushing the whole image to the stack, it actually worked. — Fatih Fatih, Jan 15 '23 at 13:49
What is the meaning of this sentence: "and the entire image becomes **becomes** or white."? — Sep Roland, Jan 15 '23 at 14:48

score 3 · Answer 1 · 2023-01-15T16:31:52.733

You can easily do this for a 3x1 dilation, by keeping the few pixels that you will modify.

Assume the three consecutive pixel values A, B, C. Assume that you already modified A, and are busy dilating for B (i.e. you will replace B by max(A, B, C)). You can use the sequence of operations

SavedA= SavedB
SavedB= B
B= max(SavedA, SavedB, C)

in a loop from left to right. (I leave you the processing of the first and last pixels in the row as an exrecise.)

You can generalize this to 1x3 dilation, and 3x3 dilation by successive application of 3x1 and 1x3. Similar for erosions.

For (2n+1)x1, repeat n times.

This works for grayscale images as well. You can also adapt for packed binary formats.

huseyin tugrul buyukisik · Answer 2 · 2023-01-15T16:03:01.403

If input image is monochrome, you can use extra few values to mark pixels with extra meaning, then process the pixels, then clear the marks before exit.

If pixel data is 16 bits and if you use only 1 bit for monochrome, the remaining 15 bits can easily hold all the values like "to be deleted", "to be painted". It would be just about OR operation on the pixel bits and values 2 or 4.

Otherwise you have to change the pixel values before neighbor pixels need them which adds a bias to the algorithm.

If AVX vectors are allowed (on 32 bit environments, only half of number of registers available (see Peter Cordes' comment below)) and pixels are fully colored, you could store 16 pixels (each 16bits) as a tile of 4x4 size on a single AVX register. If AVX instruction set allows use of 16 registers, then it lets 16x16 tile be computed in single CPU core. To compute whole 512x512 image without using any array on RAM, you'd require at least 1024 cores (or threads) in-flight. If threads do not count as an array for the problem, it should work. But the trick would be computation should not write results until all computations are completed. Once all cores write their own tiles, it should be synchronized as complete. But extracting/altering lanes of AVX registers would cost extra performance.

If AVX512 allowed (32bit mode = 1/4 total registers available, though), that would make 32x32 sized tile per core or need just 256 cores/threads.

I don't know if you are also allowed to push values to FPU for extra space (maybe not much but still could hold some border pixels between 2 tiles or something else).

If image is monochrome again, you can apply a RLE to compress it. Once compressed, it should cost much less area in the input array. Then you can use the rest as an output-RLE. On final step, do decompression from there to whole input again. Run Length Encoding is fast but decompressing for every pixel-read will be slow or at least must be too much book-keeping. Still, performance is not the priority here right? Maybe complexity can be reduced by having 1 new RLE per row. Didn't try.

MSVC inline asm implies 32-bit mode, so only 8 vector registers available (AVX2 or AVX-512). Although they didn't actually *say* MSVC; ICC and `clang-cl` or `clang -fasm-blocks` can compile the same `__asm { MOV EBX, image_org }` syntax for 64-bit mode. But they're using a 32-bit register for a pointer, which indicates that they're currently planning to use 32-bit mode. (Since it's for an assignment, I'm guessing they're probably required to use MSVC.) — Peter Cordes, Jan 15 '23 at 15:45
If a scan-line processing pattern is taken, then number of cores would be only image width divided by tile size. So for 512x512 image and 4x4 tile, there should be only 128-256 cores instead of 1024. For AVX512 and 32x32 tiles, there would be only 16-32 cores. Still I don't know how to synchronize threads within asm block (guessing a manual call to mutex lock?) — huseyin tugrul buyukisik, Jan 15 '23 at 15:55
32-bit mode can still store 16x 16-bit pixels in one 32-byte YMM register and potentially use AVX2 instructions to do work in parallel. You put the "if 64bit" condition in the wrong place; it applies to how many vector registers you can use 8 vs. 16 or 32, not to whether they're usable at all. — Peter Cordes, Jan 15 '23 at 16:01

Is it possible for a dilation or erosion algorithm to not use a separate array?

2 Answers2