Optimise gausian Blur Filter

Question

I need to apply Gausian filter on large image source. I have implemented below algorithm.I have optimized it using neon and got significant performance gain but still need improvement to match real time.Can some one please suggest if there is room for further improvement, specially in neon code. I feel my neon code is not completely optimized and could have processed 16 pixels at a time. I am very beginner to neon so could not write very improved code i will be very helpful if some one can provide improved code if possible.

void BlurRow( src, dest, gausian )
{
     process each pixel from src and calculate destination pixel value r g b a  
     by calling ComputeFinalPixelvalue
}

void BlurImage( src, dest )
{
   for each row call BlurRow with gausian kerner gx
   transpose matrix
   for each row call blur row with gausian kerner gy
   transpose matrix 
}

void ComputeFinalPixelvalue(const uint32_t* sourcePixels, 
                            uint32_t pixelcount, uint16_t* pGaussElements, 
                            uint32_t& rvalue, uint32_t& gvalue, uint32_t& bvalue, uint32_t& avalue )
{
// initialize all vectors lane with 0

uint32x4_t  sumOfChannelG_32x4 = { 0, 0, 0, 0 }, sumOfChannelB_32x4 = { 0, 0, 0, 0 }, sumOfChannelR_32x4 = { 0, 0, 0, 0 }, sumOfChannelA_32x4 = { 0, 0, 0, 0 };

int32x4_t  SrcPixels32x4_low, SrcPixels32x4_high, vGaussElement_32x4_low, vGaussElement_32x4_high;

for (int i = 0; i< pixelcount / 8; i++)
{
  // load interleaved 8 pixel at a time
  uint8x8x4_t SrcPixels8x8x4 = vld4_u8( reinterpret_cast< const unsigned char* >( sourcePixels ) );

  // load 8 GaussElement at a time
  uint16x8_t vGaussElement_16x8 = vld1q_u16(pGaussElements);

  vGaussElement_32x4_low = vmovl_u16(vget_low_u16(vGaussElement_16x8));
  vGaussElement_32x4_high = vmovl_u16(vget_high_u16(vGaussElement_16x8));

  // channel 0
  sumOfChannelR_32x4 = vmlaq_u32(sumOfChannelB_32x4, vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_low);
  sumOfChannelR_32x4 = vmlaq_u32(sumOfChannelB_32x4, vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high);

  /// channel 1
  sumOfChannelG_32x4 = vmlaq_u32(sumOfChannelG_32x4, vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[1]))), vGaussElement_32x4_low);
  sumOfChannelG_32x4 = vmlaq_u32(sumOfChannelG_32x4, vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[1]))), vGaussElement_32x4_high);

  /// channel 2
  sumOfChannelB_32x4 = vmlaq_u32(sumOfChannelG_32x4, vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[2]))), vGaussElement_32x4_low);
  sumOfChannelB_32x4 = vmlaq_u32(sumOfChannelG_32x4, vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[2]))), vGaussElement_32x4_high);

  /// channel 3
  sumOfChannelB_32x4 = vmlaq_u32(sumOfChannelG_32x4, vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[3]))), vGaussElement_32x4_low);
  sumOfChannelB_32x4 = vmlaq_u32(sumOfChannelG_32x4, vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[3]))), vGaussElement_32x4_high);

  sourcePixels = sourcePixels + 8;

  pGaussElements = pGaussElements + 8;
}

gvalue += vgetq_lane_u32(sumOfChannelG_32x4, 0) + vgetq_lane_u32(sumOfChannelG_32x4, 1) + vgetq_lane_u32(sumOfChannelG_32x4, 2) + vgetq_lane_u32(sumOfChannelG_32x4, 3);

// simillarily calculate others
}

Transposing a matrix is a very slow operation, I suspect. Try timing the function with and without calling `BlurRow` and see how much time it's taking. — Mark Ransom, Feb 11 '16 at 23:20
It is possible to implement a Gaussian blur in linear time as opposed to being quadratic time. This is possible by using a series of box filters of varying sizes. Have a look here for insight: http://blog.ivank.net/fastest-gaussian-blur.html. However, the implementation is in Javascript but it should be very easily translatable to C. — rayryeng, Feb 11 '16 at 23:54
@rayryeng yeah i understand we can implement Gaussian blur in linear time but its just approximation of Gaussian blur. Actually i want both options approximation as well as Gaussian blur. so was wondering if more optimization can be done in neon code — Bharat Ahuja, Feb 12 '16 at 13:46
@Mark Ransom Transposing is actually helping in some way since otherwise accessing pixels column wise for y kernel is very expensive.. I believe if some more optimization can be put into neon code that will really give a great gain — Bharat Ahuja, Feb 12 '16 at 13:49
i could have used transpose algorithm [link] http://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c but again i dont have that much extra memory on mobile device . I believe if some more optimization can be put into neon code that will really give a great gain — Bharat Ahuja, Feb 12 '16 at 13:55
Have you tried the timing I suggested? Speeding up the Neon code isn't going to help if transposing is already taking 75% of the time. — Mark Ransom, Feb 12 '16 at 14:23
yes i did llike, i said transposing is actually helping in some way since otherwise accessing pixels column wise for y kernel is very expensive so acording to your suggestion rather than transposing i have calculated destination pixels and stored them in column wise manner so that we can speed up gy filter — Bharat Ahuja, Feb 12 '16 at 14:32

Optimise gausian Blur Filter

0 Answers0