0

I am trying to implement round function using ARM Neon intrinsics.

This function looks like this:

float roundf(float x) {
    return signbit(x) ? ceil(x - 0.5) : floor(x + 0.5);
}

Is there a way to do this using Neon intrinsics? If not, how to use Neon intrinsics to implement this function?

edited

After calculating the multiplication of two floats, call roundf(on armv7 and armv8).

My compiler is clang.

this can be done with vrndaq_f32: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8.

How to do this on armv7?

edited

My implementation

// input: float32x4_t arg
float32x4_t vector_zero = vdupq_n_f32(0.f);
float32x4_t neg_half = vdupq_n_f32(-0.5f);
float32x4_t pos_half = vdupq_n_f32(0.5f);

uint32x4_t mask = vcgeq_f32(arg, vector_zero);
uint32x4_t mask_neg = vandq_u32(mask, neg_half);
uint32x4_t mask_pos = vandq_u32(mask, pos_half);
arg = vaddq_f32(arg, (float32x4_t)mask_pos);
arg = vaddq_f32(arg, (float32x4_t)mask_neg);
int32x4_t arg_int32 = vcvtq_s32_f32(arg);
arg = vcvtq_f32_s32(arg_int32);

Is there a better way to implement this?

  • Many neon instructions have optional rounding baked in. Depending on where you want to use this, it might be better to combine the rounding with whatever operation precedes it. –  Oct 28 '21 at 13:52
  • 2
    Are you working on the compiler? – 0___________ Oct 28 '21 at 14:17
  • there are no direct rounding instructions for `float` data on neon. However, you can convert to `int` with one fraction bit, then use it for rounding. – Jake 'Alquimista' LEE Oct 28 '21 at 15:01
  • 2
    Is this for 32 or 64 bit? AArch64 has the `FRINTZ` instruction which I think is what you want. – Nate Eldredge Oct 28 '21 at 17:09
  • This is what, round to nearest with ties toward 0.0? So it's the C `roundf()` function? As Nate says, this can be done with `frint` for AArch64. Clang really does inline `round()` as `frinta`: https://godbolt.org/z/o8arE1795. But not for ARM32. – Peter Cordes Oct 28 '21 at 23:50
  • this can be done with vrndaq_f32: https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiessimdisa=[Neon]&q=vrndaq_f32 for armv8. – 洋葱骑士 Oct 29 '21 at 03:56
  • Oh sorry, it's rounding ties away from zero, I forgot which way `signbit` goes. – Nate Eldredge Oct 29 '21 at 04:27
  • You should be aware that `floor(x + 0.5)` has some corner cases where it does the wrong thing. E.g., assuming IEEE 754 binary32 format for `float` and round-ties-to-even for arithmetic operations, if `x = 0x1.fffffep-2` (around 0.49999997) then it should round to `0`, but `round(x + 0.5)` will round it to `1.0`. Similarly, if `x = 10000001` (which is exactly representable in binary32 format) then under normal rounding rules `x + 0.5` will round `x` to `10000002` instead of `10000001` – Mark Dickinson Oct 29 '21 at 14:39
  • Sorry, that `round(x + 0.5)` should say `floor(x + 0.5)` in the previous comment. Related: https://stackoverflow.com/q/9902968/270986 – Mark Dickinson Oct 29 '21 at 14:47

2 Answers2

1

It's important that you define which form of rounding you really want. See Wikipedia for a sense of how many rounding choices there are.

From your code-snippet, you are asking for commercial or symmetric rounding which is round-away from zero for ties. For ARMv8 / ARM64, vrndaq_f32 should do that.

The SSE4 _mm_round_ps and ARMv8 ARM-NEON vrndnq_f32 do bankers rounding i.e. round-to-nearest (even).

Chuck Walbourn
  • 38,259
  • 2
  • 58
  • 81
  • Are you sure about `vrndaq_f32`? The manual link in the question says it rounds "to nearest with ties **Away**", i.e. `round()` not `nearbyint()` like `_mm_round_ps`. Also, if ARM blends work like x86 `vblendps`, you don't need to *compute* a select mask, you can literally use the FP bit patterns as selectors. -0.0 and +0.0 will still round to some kind of zero, I think actually preserving the input sign. (Which wouldn't happen if you compared for equality against +0.0) – Peter Cordes Oct 29 '21 at 04:34
  • Let me double-checking my test coverage on ARM64 then. If ``vrndaq_f32`` is supposed to be like C99 ``roundf`` then something's amiss. – Chuck Walbourn Oct 29 '21 at 04:52
  • 1
    Sorry, I'm actually using ``vrndnq_f32`` which is ``FRINTN`` which is bankers. ``vrndaq_f32`` is ``FRINTA`` which is nearest, ties away from zero. – Chuck Walbourn Oct 29 '21 at 05:04
0

Your solution is VERY expensive, both in cycle counts and register utilization.

Provided -(2^30) <= arg < (2^30), you can do following:

int32x4_t argi = vcvtq_n_s32_f32(arg, 1);
argi = vsraq_n_s32(argi, argi, 31);
argi = vrshrq_n_s32(argi, 1);
arg = vcvtq_f32_s32(argi);

It doesn't require any other register than arg itself, and it will be done with 4 inexpensive instructions. And it works both for aarch32 and aarch64

godblot link

Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25