returning Z flag under ARM NEON

Question

I have a NEON function doing some comparisons:

inline bool all_ones(int32x4_t v) noexcept
{
  v = ~v;

  ::std::uint32_t r;

  auto high(vget_high_s32(int32x4_t(v)));
  auto low(vget_low_s32(int32x4_t(v)));

  asm volatile ("VSLI.I32 %0, %1, #16" : "+w"(high), "+w"(low));
  asm volatile ("VCMP.F64 %0, #0" : "=w"(high));
  asm volatile ("VMRS %0, FPSCR" : "=r"(r) : "w"(high));

  return r & (1 << 30);
}

Components (4 ints) of v can only be all ones or all zeros. If all 4 components are all ones, the function returns true and false otherwise. The return part expands into 3 instructions, which is a lot for me. Does there exist a better way to return the Z flag?

EDIT: After a long, hard pondering the above could have been replaced by:

inline bool all_ones(int32x4_t const v) noexcept
{
  return int32_t(-1) == int32x2_t(
    vtbl2_s8(
      int8x8x2_t{
        int8x8_t(vget_low_s32(int32x4_t(v))),
        int8x8_t(vget_high_s32(int32x4_t(v)))
      },
      int8x8_t{0, 4, 8, 12}
    )
  )[0];
}

There exists a mask extraction instruction in NEON.

What Are You Really Trying To Do™? What's the purpose of this function? What are you going to do with the result? — Stephen Canon, Apr 23 '15 at 19:14
If `v` can only be all ones or all zeroes then just compare one of the bytes against `0`? I dunno. — Lightness Races in Orbit, Apr 23 '15 at 19:24
SSE has a quick way to generate a "mask" from the high bits of every vector element, but not NEON. The compiler is probably doing 2 pairwise adds and then comparing the result to -4 (all 4 elements true = -1 each one) — BitBank, Apr 23 '15 at 20:21
@BitBank I'd pase the instructions, but I compile in QEMU and can't copy paste. It seems to AND a register with a mask and then shifts to right 30 times. Your pairwise idea is very good :) How would I coax the compiler into doing that (other than with asm (), of course)? BTW: Could I capture the Z flag using bit-fields? — user1095108, Apr 23 '15 at 20:28
As an alternative to `vpadd_s32` twice and comparing a lane with -4 you could also go for `vzip_u8` twice and compare a (u32)lane with 0xffffffff also in 4 instructions - this essentially becomes a duplicate of [this question](http://stackoverflow.com/questions/29167707/translating-sse-to-neon-how-to-pack-and-then-extract-32bit-result) if you look closely enough. Either way, not abusing floating-point would be nice ;) I don't think reading FPSCR is as horribly expensive as writing it, but still... — Notlikethat, Apr 23 '15 at 20:37
@Notlikethat It is not really the expense, that is bothering me, but the inability of the compiler to optimize the extraction of the Z flag without shifting and ANDing. — user1095108, Apr 23 '15 at 20:44
Isn't `bool` defined to take the values 0 and 1 specifically? In which case `(r & (1 << 30)) >> 30` _is_ some kind of optimisation of the type conversion compared to testing if it's nonzero and conditionally putting 0 or 1 into the register. Having the code be `return (r >> 30) & 1` would make more sense in that particular respect. — Notlikethat, Apr 23 '15 at 20:52
Remember that modern processors have barrel shifters, so (r>>30) is a single clock instruction. — BitBank, Apr 23 '15 at 21:12
Well, that whole expression should really compile down to a single `ubfx` either way on anything sufficiently modern, but it all depends on the compiler and optimisation settings. — Notlikethat, Apr 23 '15 at 22:37
To be clear, the objective of the question is to have the inlined function produce a test result which can be used by the caller for a conditional operation; rather than performing extra work to produce a specific number which then has to be compared with zero. Is that right? — sh1, Apr 24 '15 at 15:30
@BitBank But is possible to extract masks in NEON, check out VTBL - it is the canonical mask extractor instruction. — user1095108, May 01 '15 at 15:00

sh1 · Accepted Answer · 2015-04-24T02:11:33.960

1

You really don't want to mix NEON with VFP if you can avoid it.

I suggest:

bool all_ones(int32x4_t v) {
    int32x2_t l = vget_low_s32(v), h = vget_high_s32(v);
    uint32x2_t m = vpmin_u32(vreinterpret_u32_s32(l),
                             vreinterpret_u32_s32(h));
    m = vpmin_u32(m, m);
    return vget_lane_u32(m, 0) == 0xffffffff;
}

If you're really sure the only non-zero value will be 0xffffffff then you can drop the comparison. Compiled standalone it might have a couple of unnecessary operations, but when it's inlined the compiler should fix that.

edited Apr 24 '15 at 02:11

answered Apr 23 '15 at 22:04

sh1

4,324
17
30

I suggested this seven hours ago. OP said "it has to be done in parallel". Can you explain why this solution doesn't work "in parallel"? Such a statement makes no sense to me. – Lightness Races in Orbit Apr 24 '15 at 02:52
I think it's not really clear what you meant by 'one of the bytes'. You have to perform some kind of permutation to get a byte from each lane into a shared word where it can be tested. The method I posted here uses `VPMIN` for its permute. If either of an adjacent pair of words is zero, then the result is zero; and this has to be done twice to bring all four words into the same lane before moving it out to scalar for a final compare to see if any of the words were zero. – sh1 Apr 24 '15 at 03:25
The question has now been changed; originally it said that `v` could only be zero or one :) – Lightness Races in Orbit Apr 24 '15 at 09:33
@LightningRacisinObrit Yeah, I wasn't clear and edited due to your comment. – user1095108 Apr 24 '15 at 10:18

score 0 · Answer 2 · answered Apr 23 '15 at 22:05

This seems to do the trick:

inline bool all_ones(int32x4_t v) noexcept
{
  v = ~v;

  auto high(vget_high_s32(int32x4_t(v)));
  auto low(vget_low_s32(int32x4_t(v)));

  asm volatile ("VSLI.I32 %0, %1, #16" : "+w"(high), "+w"(low));

  return !reinterpret_cast<double&>(high);
}

But the zip and pairwise add trick produce superior code.

returning Z flag under ARM NEON

2 Answers2