1

While learning Arm NEON instruction set, I tried to implement 256-bit numbers comparison (A <= B). Below is the implementation I ended up with, but I doubt my approach is good. Maybe there's some wiser and more optimized way to compare large numbers? Any advice and comments will be appreciated.

// v1:v0 = number A
// v11:v10 = number B
// out: x0 = 0 if A <= B

        // Check A >= B (element by element) - Result1
        cmhs v13.4s, v1.4s, v11.4s
        cmhs v12.4s, v0.4s, v10.4s

        // Check A > B (element by element) - Result2
        cmhi v11.4s, v1.4s, v11.4s
        cmhi v10.4s, v0.4s, v10.4s

        // Narrow down Result2 vector to 64 bits
        xtn v10.4h, v10.4s
        xtn2 v10.8h, v11.4s
        xtn v10.8b, v10.8h
        
        // Narrow down Result1 vector to 64 bits
        xtn v11.4h, v12.4s
        xtn2 v11.8h, v13.4s
        xtn v11.8b, v11.8h
        not v11.8b, v11.8b

        // Compare elements in Result1 and Result2 as scalars
        // if Result2 has all bits set in higher elements than Result1,
        // then A > B
        cmhi d9, d10, d11
        mov x0, v9.d[0]
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alexander Zhak
  • 9,140
  • 4
  • 46
  • 72
  • Do you want to compare one pair of 256-bit numbers or several (which possibly would allow more efficient solutions)? – Sebastian Mar 23 '22 at 14:12
  • @Sebastian, for now one pair – Alexander Zhak Mar 23 '22 at 14:47
  • 2
    Similar question: https://stackoverflow.com/questions/29742844/a64-neon-simd-256-bit-comparison – Sebastian Mar 23 '22 at 15:15
  • 2
    Two approaches: With the 2 x 4 x 64-bit comparisons you get 8 bits. You could use those for a table lookup for the resulting boolean value. Other possibility: L contains 4 bits for less results. G contains 4 bits for greater results. The overall result is `L >= G`. Not sure, if this is the same as your approach. – Sebastian Mar 23 '22 at 15:56
  • You could use one `uzp1` instead of a pair of `xtn` and `xtn2` – Jake 'Alquimista' LEE Mar 30 '22 at 14:51

1 Answers1

2

I reversed the boolean, replaced xtn/xtn2 pairs with uzp1, and re-scheduled the insructions.

cmhi v12.4s, v10.4s, v0.4s
cmhi v13.4s, v11.4s, v1.4s

cmhi v10.4s, v0.4s, v10.4s
cmhi v11.4s, v1.4s, v11.4s

uzp1    v12.8h, v12.8h, v13.8h
uzp1    v10.8h, v10.8h, v11.8h

xtn     v12.8b, v12.8h
xtn     v10.8b, v10.8h

cmhi    d10, d10, d12
mov     x0, v10.d[0]
Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25