6

There is a big (~100 000) array of floating point variables, and there is a threshold (also floating point).

The problem is that I have to compare each one variable from the array with a threshold, but NEON flags transfer takes a really long time (~20 cycles in accordance to a profiler).

Is there any efficient way to compare these values?

NOTE: As rounding error doesn't matter, I tried the following:

float arr[10000];
float threshold; 
....

int a = arr[20]; // e.g.
int t = threshold;
if (t > a) {....}

But in this case I getting the following processor command sequence:

vldr.32        s0, [r0]
vcvt.s32.f32   s0, s0
vmov           r0, s0    <--- takes 20 cycles as `vmrs APSR_nzcv, fpscr` in case of 
cmp            r0, r1         floating point comparison

As conversion happens at NEON, there is no matter if I compare integers, by described way or floats.

Alex
  • 9,891
  • 11
  • 53
  • 87
  • People over at codereview.stackexchange.com might have to say something to this too. – PlasmaHH Apr 30 '12 at 10:20
  • 3
    Your code is inconsistent with your problem statement - the data is float but you show threshold as int - you also cast each float data value to int - why ? If you data is float then the threshold should be float and you should do a float comparison (i.e. no int-float conversions). Also, what do you plan to do with values that are greater than (or less than) threshold (this will determine whether NEON is appropriate or not) ? – Paul R Apr 30 '12 at 10:28
  • 2
    Many people ditch NEON for being slower than ARM without knowing what to avoid and how to program SIMD properly. Depending on what you exactly want, it's either not SIMD feasible to start with, or you don't know how to handle if-else with NEON. – Jake 'Alquimista' LEE May 02 '12 at 02:36
  • 2
    if-else on NEON : 1. create a mask value with a VCnn instruction. 2. calculate results for both cases (if, else) in two different registers. 3. combine both results with a VBnn instruction using the mask value from 1. – Jake 'Alquimista' LEE May 02 '12 at 02:47
  • 2
    Since you always have to calculate both cases on NEON regardless of the condition met, it might be slower than on ARM, if the calculations are very complex. If there are multiple associated if-elses, forget NEON. – Jake 'Alquimista' LEE May 02 '12 at 02:54
  • @Jake'Alquimista'LEE unfortunately, I can't forget about NEON as I have to use it for all floating point calculations. – Alex May 02 '12 at 06:05
  • 2
    NEON->ARM register transfers always suffer 11~14 cycles. This should be avoided at all costs within loops. And why those type castings when you can compare floats very fast even without rounding errors? I'll show you how in a separate answer. – Jake 'Alquimista' LEE May 02 '12 at 07:06

4 Answers4

5

If floats are 32-bit IEEE-754 and ints are 32-bit too and if there are no +infinity, -infinity and NaN values, we can compare floats as ints with a little trick:

#include <stdio.h>
#include <limits.h>
#include <assert.h>

#define C_ASSERT(expr) extern char CAssertExtern[(expr)?1:-1]
C_ASSERT(sizeof(int) == sizeof(float));
C_ASSERT(sizeof(int) * CHAR_BIT == 32);

int isGreater(float* f1, float* f2)
{
  int i1, i2, t1, t2;

  i1 = *(int*)f1;
  i2 = *(int*)f2;

  t1 = i1 >> 31;
  i1 = (i1 ^ t1) + (t1 & 0x80000001);

  t2 = i2 >> 31;
  i2 = (i2 ^ t2) + (t2 & 0x80000001);

  return i1 > i2;
}

int main(void)
{
  float arr[9] = { -3, -2, -1.5, -1, 0, 1, 1.5, 2, 3 };
  float thr;
  int i;

  // Make sure floats are 32-bit IEE754 and
  // reinterpreted as integers as we want/expect
  {
    static const float testf = 8873283.0f;
    unsigned testi = *(unsigned*)&testf;
    assert(testi == 0x4B076543);
  }

  thr = -1.5;
  for (i = 0; i < 9; i++)
  {
    printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr);
  }

  thr = 1.5;
  for (i = 0; i < 9; i++)
  {
    printf("%f %s %f\n", arr[i], "<=\0> " + 3*isGreater(&arr[i], &thr), thr);
  }

  return 0;
}

Output:

-3.000000 <= -1.500000
-2.000000 <= -1.500000
-1.500000 <= -1.500000
-1.000000 >  -1.500000
0.000000 >  -1.500000
1.000000 >  -1.500000
1.500000 >  -1.500000
2.000000 >  -1.500000
3.000000 >  -1.500000
-3.000000 <= 1.500000
-2.000000 <= 1.500000
-1.500000 <= 1.500000
-1.000000 <= 1.500000
0.000000 <= 1.500000
1.000000 <= 1.500000
1.500000 <= 1.500000
2.000000 >  1.500000
3.000000 >  1.500000

Of course, it makes sense to precalculate that final integer value in isGreater() that's used in the comparison operator if your threshold doesn't change.

If you are afraid of undefined behavior in C/C++ in the above code, you can rewrite the code in assembly.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
2

If your data is float then you should do your comparisons with floats, e.g.

float arr[10000];
float threshold;
....

float a = arr[20]; // e.g.
if (threshold > a) {....}

otherwise you will have expensive float-int conversions.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • If I make a comparison between 2 floats it causes expensive flag-register transfer. That's why I tried to make a comparison between 2 integers. – Alex Apr 30 '12 at 10:35
  • What operations are you subsequently performing when the threshold test is true/false ? – Paul R Apr 30 '12 at 10:37
  • vcmpe.f32 s17, s16 vmrs APSR_nzcv, fpscr If I got your question. – Alex Apr 30 '12 at 10:43
  • Alex: what I meant was: what happens in the `{ ... }` after your threshold test ? This will determine whether using NEON for the test is appropriate or not. – Paul R Apr 30 '12 at 10:49
  • There is a lot of code after test, that's why I marked it as `...`. – Alex Apr 30 '12 at 10:55
  • 1
    Well if there's a lot of code then the cost of the test should be negligible (unless the overwhelming majority of data points do not need processing ?). – Paul R Apr 30 '12 at 11:04
2

Your example shows how bad compiler-generated codes can be :

It loads a value with NEON just to convert it to int, then does a NEON->ARM transfer that causes a pipeline flush resulting in 11~14 cycles wasted.

The best solution would be writing the function completely in hand assembly.

However, there is a simple trick that allows fast float comparisons without typecasting AND truncation:

Threshold positive (exactly as fast as int comparison) :

void example(float * pSrc, float threshold, unsigned int count)
{
  typedef union {
    int ival,
    unsigned int uval,
    float fval
  } unitype;

  unitype v, t;
  if (count==0) return;
  t.fval = threshold;
  do {
    v.fval = *pSrc++;
    if (v.ival < t.ival) {
      // your code here
    }
    else {
      // your code here (optional)
    }
  } while (--count);
}

Threshold negative (1 cycle more per value than int comparison):

void example(float * pSrc, float threshold, unsigned int count)
{
  typedef union {
    int ival,
    unsigned int uval,
    float fval
  } unitype;

  unitype v, t, temp;
  if (count==0) return;
  t.fval = threshold;
  t.uval &= 0x7fffffff;
  do {
    v.fval = *pSrc++;
    temp.uval = v.uval ^ 0x80000000;
    if (temp.ival >= t.ival) {
      // your code here
    }
    else {
      // your code here (optional)
    }
  } while (--count);
}

I think it to be quite a lot faster than the accepted one above. Again, I'm a bit too late.

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
0

If the rounding errors do not matter, then you should use std::lrint.

The Faster Floating Point to Integer Conversions recommends to use it for float to int conversion.

BЈовић
  • 62,405
  • 41
  • 173
  • 273