Resizing single 1 pixel wide bitmap strip - faster than this example? (for Raycaster algorithm)

Question

I am attaching the picture example and my current code.

My question is: Can I make resizing/streching/interpolating single vertical bitmap strip faster that using another for-loop.

The current Code looks very optimal:

for current strip size in the screen, iterate from start height to end height. Get corresponding pixel from texture and add to output buffer. Add step to get another pixel.

here is an essential part of my code:

inline void RC_Raycast_Walls()
{
  // casting ray for every width pixel
    for (u_int16 rx = 0; rx < RC_render_width_i; ++rx)
    {
        // ..
        // traversing thru map of grid
        // finding intersecting point
        // calculating height of strip in screen
        // ..


        // step size for nex pixel in texutr
        float32 tex_step_y = RC_texture_size_f / (float32)pp_wall_height;

        // starting texture coordinate
        float32 tex_y = (float32)(pp_wall_start - RC_player_pitch - player_z_div_wall_distance - RC_render_height_d2_i + pp_wall_height_d2) * tex_step_y;

        // drawing walls into buffer <- ENTERING ANOTHER LOOP only for SINGLE STRIP
        for (int16 ry = pp_wall_start; ry < pp_wall_end; ++ry)
        {
            // cast the texture coordinate to integer, and mask with (texHeight - 1) in case of overflow
            u_int16 tex_y_safe = (u_int16)tex_y & RC_texture_size_m1_i;
            tex_y += tex_step_y;

            u_int32 texture_current_pixel = texture_pixels[RC_texture_size_i * tex_y_safe + tex_x];
            u_int32 output_pixel_index = rx + ry * RC_render_width_i;

            output_buffer[output_pixel_index] =
                                                (((texture_current_pixel >> 16 & 0x0ff) * intensity_value) >> 8) << 16 |
                                                (((texture_current_pixel >> 8 & 0x0ff) * intensity_value) >> 8) << 8 |
                                                (((texture_current_pixel & 0x0ff) * intensity_value) >> 8);
        }
    }
}

Maybe some bigger stepping like 2 instead of 1, got then every second line empty, but adding another line of code that could fil that empty space results the same performance.. I would not like to have doubled pixels and interpolating between two of them I think would take even longer. ??

Thank You in Advance!

ps. Its based on Lodev Raycaster algorithm: https://lodev.org/cgtutor/raycasting.html

Spektre · Accepted Answer · 2021-03-23T11:43:00.893

You do not need floats at all

You can use DDA on integers without multiplication and division. These days floating is not that slow as it used to but your conversion between float and int might be ... See these QAs (both use this kind of DDA:
- DDA line with subpixel
- DDA based rendering routines
use LUT for applying Intensity

Looks like each color channel c is 8 bit and intensity i is fixed point in range <0,1> so you can precompute every combination into something like this:
```
u_int8 LUT[256][256]
for (int c=0;c<256;c++)
 for (int i=0;i<256;i++)
  LUT[c][i]=((c*i)>>8)
```

use pointers or union to access RGB channels instead of bit operations

My favorite is union:

union color
   {
   u_int32 dd;    // 1x 32bit RGBA
   u_int16 dw[2]; // 2x 16bit
   u_int8 db[4];  // 4x 8bit (individual channels)
   };

texture coordinates

Again looks like you are doing too many operations. for example [RC_texture_size_i * tex_y_safe + tex_x] if your texture size is 128 you can bitshift lef by 7 bits instead of multiplication. Yes on modern CPUs is this not an issue however the whole thing can be replaced by simple LUT. You can remember pointer to each horizontal ScanLine of texture and rewrite to [tex_y_safe][tex_x]

So based on #2,#3 rewrite your color computation to this:

color c;
c.dd=texture_current_pixel;
c.db[0]=LUT[c.db[0]][intensity_value];
c.db[1]=LUT[c.db[1]][intensity_value];
c.db[2]=LUT[c.db[2]][intensity_value];
output_buffer[output_pixel_index]=c.dd;

As you can see its just bunch of memory transfers instead of multiple bit-shifts,bit-masks and bit-or operations. You can also use pointer of color instead of texture_current_pixel and output_buffer[output_pixel_index] to speed up little more.

And finally see this:

Ray Casting with different height size

Which is my version of the raycast using VCL.

Now before changing anything measure the performance you got now by measuring the time it needs to render. Then after each change in the code measure if it actually improve performance or not. In case it didn't use old version of code as predicting what is fast on nowadays platforms is sometimes hard.

Also for resize much better visual results are obtained by using mipmaps ... that usually eliminates the weird noise while moving

Hi @Spektre thanks for lot of informations and tips. Yes the speedup is signifficant!! I spent a whole day yesterday testing different approaches and already managed to figure out some of these tricks, mostly 1,2 and 4 but in a bit different way. (1) I converted tex_step_y and tex_y from float into fixed point, so instead of adding floats I am adding ints in second loop. I tried to calculate everything based on fixed-points but instead of speed up I got everything a lot slower - I will also look at the DDA you linked. (2) Yes, precalculating multiplied values of intensity speeds up a lot. — Mateusz, Mar 23 '21 at 12:11
(3) Big thanks for that idea, didn't test yet, but I wanted to optimize this bitshifting. I hope I will speed up a little. (4) I did it in a bit diferetn way - but will also try Your solution. THANKS AGAIN!!! — Mateusz, Mar 23 '21 at 12:12
@Mateusz on modern CPU direct brunchless DDA on floats is faster than DDA on integers however the conversion from float back to int is a problem as it requires quite a lot of operations. Fixed point is not faster because you still need to shift ... however some speed up can be obtained by using the union again like for color but usually not that much ... I usually use the DDA I linked as its fast on both old and new platforms ... and completely ignore Bresenham as its not fast since 386 — Spektre, Mar 23 '21 at 12:15
I wanted to try to port my Raycaster to faster Amiga with m68080 Apollo CPUs with RTG, so far I got above 60fps in 320x240x32 and about 17fps in 640x480x32, all with textured walls, but so far ceil and floor are only shaded without textures.. I am focusing right now on optimizing this wall part. I would also like to test only every second ray some how to speed up everything but I would't like to duplicate pixels. This is older video testing all textures on Amiga 600 with Vampire accelerator card, before optimalizations: https://www.youtube.com/watch?v=i5R73fmAwkk — Mateusz, Mar 23 '21 at 13:15
There is a much more potential in those cards, especially if you are using asm and its ammx instructions, but I am sticking to C, I wonder how much I will be able to achieve.. so far so good :) — Mateusz, Mar 23 '21 at 14:00

Resizing single 1 pixel wide bitmap strip - faster than this example? (for Raycaster algorithm)

1 Answers1