1

I would like to use ARM Neon to resize a 8-bit grey image by a factor of 4 from 1280x960 to 320x240.

As an example, I already have a resize by a factor of 2 from 640x480 to 320x240:

void divideimageby2(uint8_t * src, uint8_t * dest) {
    //src is 640 x 480
    //dst is 320 x 240
    int h;
    for (h = 0; h < 240; h++)
        resizeline2(src + 640 * (h * 2 + 0), src + 640 * (h * 2 + 1), dt + 320 * h);
}

void resizeline2(uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict dest) {
    int w;
    for (w = 0; w < 640; w += 16) {
        uint16x8_t a = vpaddlq_u8(vld1q_u8(src1));
        uint16x8_t b = vpaddlq_u8(vld1q_u8(src2));
        uint16x8_t ab = vaddq_u16(a, b);
        vst1_u8(dest, vshrn_n_u16(ab, 2));
        src1 += 16;
        src2 += 16;
        dest += 8;
    }
}   

If I want to do something similar, what kind of Neon instructions could I use in resizeline4 to aggregate 4 lines?

void divideimageby4(uint8_t * src, uint8_t * dest) {
    //src is 1280 x 960
    //dst is 320 x 240
    int h;
    for (h = 0; h < 240; h++)
        resize_line2(src + 640 * (h * 4 + 0), src + 640 * (h * 4 + 1), src + 640 * (h * 4 + 2), src + 640 * (h * 4 + 3), dt + 320 * h);
}

void resizeline4(uint8_t * __restrict src1, uint8_t * __restrict src2, uint8_t * __restrict src3, uint8_t * __restrict src4, uint8_t * __restrict dest) {
    int w;
    for (w = 0; w < 1280; w += 16) {
        //What to put here?
        src1 += 16;
        src2 += 16;
        src3 += 16;
        src4 += 16;
        dest += 4;
    }
}   
gregoiregentil
  • 1,793
  • 1
  • 26
  • 56
  • How do you want to do this? You are reducing information. See: [What is the best image reduction algorithm](http://stackoverflow.com/questions/384991/what-is-the-best-image-downscaling-algorithm-quality-wise); the real answer is there is none as there are various criteria when [reducing information](http://en.wikipedia.org/wiki/Image_scaling). You can combine them. For instance a first pass integer average followed by a second pass bi-cubic will give almost as good quality as the full bi-cubic, but will be much faster. – artless noise Aug 11 '14 at 15:49
  • The key objective here is speed. I want this to be almost as fast as the Neon memcpy. – gregoiregentil Aug 12 '14 at 21:09
  • Just blank the screen then, that is fast :) You should at least evaluate different scaling algorithms before dedicating time to hand-tuned NEON. My point was that you can get near equivalent speed with **much** better image quality by using two filters. Skipping pixels/rows will be faster and give similar quality too; you are bandwidth constrained on the main image. NEON will make the CPU time non-dominant. I wish you luck in any case. – artless noise Aug 13 '14 at 16:40

1 Answers1

2

You should combine vpaddl with vpadal.

Load 32*4 matrix in q registers line1a, line1b..... line4b

vpaddl.u8 line1a, line1a

vpaddl.u8 line1b, line1b

vpadal.u8 line1a, line2a

vpadal.u8 line1b, line2b

.

.

vpadal.u8 line1b, line4b

vpadd.u16 d0, line1alow, line1ahigh

vpadd.u16 d1, line1blow, line1bhigh

vrshrn.u16 d0, q0, #4

vst1.8 {d0}, [pDst]!

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
  • This looks correct, if the reduction is done with integer average like his sample code. But most people will eventually find it is a little lacking. – artless noise Aug 11 '14 at 15:50
  • Thanks for the answer. I'm a little bit confused by your suggestion. Could you edit the answer and write something that is closer to some working code? – gregoiregentil Aug 12 '14 at 21:09