0

Hi,
With respect to my earlier post ,I resolved the Comparison operation in SSE .
But after getting the output I observed that my output is coming in floating point and my expected output is in uchar .
For example, I am expecting output as 8 and output is coming in floatng format as 8.0 (in 32 bit floating point format) .After converting that values in 1 byte unsigned which is far different than 8 ..
PFB my original code in C and its correspoding code in SSE:

C code :

unsigned char *destination_buff = (unsigned char *)malloc(sizeof(unsigned char)*height*width);
float *d1 = inputbuffer;
float *d2 = d1 + width;
float *d3 = d2 + width;

for(int i=1;i<height;i++)
{
    for(int j=1;j<width;j++)
    {
       int val = d2[j];
       int temp1 = 0x00FF;
       int temp2 = 0;   
       if(val <= d1[j-1]) temp2 += 0x80;
       if(val <= d1[j])   temp2 += 0x40;
       if(val <= d1[j+1]) temp2 += 0x20;    }
       if(val <= d2[j-1]) temp2 += 0x10;}
       if(val <= d2[j+1]) temp2 += 0x08;
       if(val <= d3[j-1]) temp2 += 0x04;
       if(val <= d3[j])   temp2 += 0x02;
       if(val <= d3[j+1]) temp2 ++;    
       temp1 &= (~temp2);
       destination_buff[j-1] = temp1;       
    }
        d1 += width;
        d2 += width;
        d3 += width;

        destination_buff += (width);
}   

Here is my SSE code :

float *destination_buff = (float *)malloc(sizeof(float)*height*width);

uchar *dst_d = outputbuffer; //Pointer to the destination buffer which is already present and need to fill the output data in this
float *CT_image_0 = m_dat;
float *CT_image_1 = CT_image_0 + width;
float *CT_image_2 = CT_image_1 + width;

for(int i=1;i<height;++i)
{
    for(int j=1;j<width;j+=4)
    {
      __m128 CT_current_00 = _mm_loadu_ps((CT_image_0+j-1));
      __m128 CT_current_10 = _mm_loadu_ps((CT_image_1+j-1));
      __m128 CT_current_20 = _mm_loadu_ps((CT_image_2+j-1));

      __m128 CT_current_01 = _mm_loadu_ps(((CT_image_0+1)+j-1));
      __m128 CT_current_11 = _mm_loadu_ps(((CT_image_1+1)+j-1));
      __m128 CT_current_21 = _mm_loadu_ps(((CT_image_2+1)+j-1));

      __m128 CT_current_02 = _mm_loadu_ps(((CT_image_0+2)+j-1));
      __m128 CT_current_12 = _mm_loadu_ps(((CT_image_1+2)+j-1));
      __m128 CT_current_22 = _mm_loadu_ps(((CT_image_2+2)+j-1));

      __m128 val    =  CT_current_11;

      __m128 t1 = _mm_set1_ps(0x80);
      __m128 t2 = _mm_set1_ps(0x40);
      __m128 t3 = _mm_set1_ps(0x20);
      __m128 t4 = _mm_set1_ps(0x10);
      __m128 t5 = _mm_set1_ps(0x08);
      __m128 t6 = _mm_set1_ps(0x04);
      __m128 t7 = _mm_set1_ps(0x02);
      __m128 t8 = _mm_set1_ps(0x01);

      __m128 out = _mm_setzero_ps();                 // init output flags to all zeroes


      __m128 sample = _mm_cmple_ps(val,CT_current_00);
             sample = _mm_and_ps(sample,t1);
               out  = _mm_or_ps(out,sample);
             sample = _mm_cmple_ps(val,CT_current_01);
             sample = _mm_and_ps(sample,t2);
               out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_02);
            sample = _mm_and_ps(sample,t3);
              out  = _mm_or_ps(out,sample);

            sample = _mm_cmple_ps(val,CT_current_10);
            sample = _mm_and_ps(sample,t4);
              out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_12);
            sample = _mm_and_ps(sample,t5);
              out  = _mm_or_ps(out,sample);

            sample = _mm_cmple_ps(val,CT_current_20);
            sample = _mm_and_ps(sample,t6);
              out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_21);
            sample = _mm_and_ps(sample,t7);
              out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_22);
            sample = _mm_and_ps(sample,t8);
              out  = _mm_or_ps(out,sample);

            _mm_storeu_ps((destination_buff+(j-1)),out);
            dst_d =  (uchar *)destination_buff;

        }

    CT_image_0  += width;
    CT_image_1  += width;
    CT_image_2  += width;

    dst_d += (width);

}

all store operations are on float and __m128i .How I can store result into uchar ??

Ashwin
  • 411
  • 1
  • 10
  • 28
  • Each iteration you produce 4 results, so what you would need to do would be every 4 iterations you pack 4 x 4 results into a single 16 x 8 bit vector and store that with `_mm_storeu_si128`. Alternatively just extract 4 bytes from each float vector on every iteration and store those using scalar code. – Paul R Oct 16 '14 at 14:43
  • Is it directly possible to shift "out" variable to extract 4bytes from it and four bytes of the "out" variables are not exactly same as" output"...as it is float variable – Ashwin Oct 16 '14 at 16:03
  • Either of the two methods I mentioned in my comment above should be fine - unfortunately I don't have to to write up a complete answer just now, but I'll take another look again tomorrow to see if you're still stuck. – Paul R Oct 16 '14 at 16:39
  • Please fix the code. You have the two lines (1) `if(val <= d1[j+1]) temp2 += 0x20; }` and (2) `if(val <= d2[j-1]) temp2 += 0x10;}` with `}` which close the `for` loops, leaving subsequent `}` to close some other construct not shown. – Jonathan Leffler Aug 02 '15 at 03:46
  • [you don't need to cast the result of malloc in C](https://stackoverflow.com/q/605845/995714) – phuclv Dec 05 '17 at 09:18

1 Answers1

0

You can do packed-compare to get a mask, but then use that mask with integer ops. _mm_set1_ps(0x80) is a sign you're doing something weird. You probably shouldn't convert power-of-two bitmasks to floating point, because adding them with _mm_add_ps is a lot slower than combining them with _mm_or_si128.

You might also be better of with palignr for some of your offset-loads, to balance your code between the load ports and the ALU ports.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847