Converting arm code to use NEON intrinsics

Question

I have been trying to modify the code beneath to work with NEON Intrinsics, thereby creating a speedup. Unfortunately nothing seems to work correctly. Does anyone have any idea what is going wrong? I updated the doubles to single floating point elements.

typedef         float       REAL;
typedef         REAL        VEC3[3];    

typedef struct  driehoek
{
    VEC3        norm;                   /* Face normal. */
    REAL        d;                      /* Plane equation D. */
    VEC3        *vptr;                  /* Global vertex list pointer. */
    VEC3        *nptr;                  /* Global normal list pointer. */
    INT         vindex[3];              /* Index of vertices. */
    INT         indx;                   /* Normal component max flag. */
    BOOL        norminterp;             /* Do normal interpolation? */
    BOOL        vorder;                 /* Vertex order orientation. */
}driehoek;

typedef struct element
{
    INT         index;
    struct object   *parent;            /* Ptr back to parent object.    */
    CHAR        *data;                  /* Pointer to data info.         */
    BBOX        bv;                     /* Element bounding volume.      */
}ELEMENT;

INT TriangleIntersection(RAY *pr, ELEMENT *pe, IRECORD *hit)
{
    FLOAT      Rd_dot_Pn;       /* Polygon normal dot ray direction. */
    FLOAT      Ro_dot_Pn;       /* Polygon normal dot ray origin.    */
    FLOAT      q1, q2;
    FLOAT      tval;            /* Intersection t distance value.    */
    VEC3       *v1, *v2, *v3;       /* Vertex list pointers.         */
    VEC3       e1, e2, e3;      /* Edge vectors.             */
    driehoek   *pt;         /* Ptr to triangle data.         */


    pt = (driehoek *)pe->data;

    Rd_dot_Pn = VecDot(pt->norm, pr->D);

    if (ABS(Rd_dot_Pn) < RAYEPS)        /* Ray is parallel.      */
        return (0);

        hit->b3 = e1[0] * (q2 - (*v1)[1]) - e1[1] * (q1 - (*v1)[0]);
        if (!INSIDE(hit->b3, pt->norm[2]))
            return (0);
        break;
    }

    return (1);
 }

How do you use NEON Intrinsics? You don't use any of it in your code so far. — auselen, May 15 '13 at 06:25

artless noise · Answer 1 · 2021-03-02T18:39:18.663

An array of float vec[3] is not enough of a hint to the compiler that NEON intrinsic can be used. The issue is that float vec[3] has each element individually addressable. The compiler must store each in a floating point register. See gcc NEON intrinsic documentation.

Although 3 dimensions is very common in this Universe, our friends the computers like binary. So you have two data types that can be used for NEON intrinsics; float32x4_t and float32x2_t. You need to use the intrinsics such as vfmaq_f32, vsubq_f32, etc. These intrinsics are different for each compiler; I guess you are using gcc. You should only use the intrinsic data types as combining float32x2_t with a single float can result in movement between register types, which is expensive. If your algorithm can treat each dimension separately, then you might be able to combine types. However, I don't think you will have register pressure and the SIMD speed-up should be beneficial. I would keep everything in float32x4_t to begin with. You maybe able to use the extra dimension for 3D-projection when it comes to the rendering phase.

Here is the source to a cmath library called math-neon under LGPL. Instead of using intrinsics with gcc, it uses inline assembler.^{Neon intrinsics vs assembly}

See also: armcc NEON intrinsics, if you are using the ARM compiler.

[OpenCV Neon Intrinscis](https://github.com/opencv/opencv/blob/master/modules/core/include/opencv2/core/hal/intrin_neon.hpp) has an implementation, but it is more complex to understand as it gives an API for AVX/SSE and other SIMD instruction sets. — artless noise, Mar 03 '21 at 12:06

Converting arm code to use NEON intrinsics

1 Answers1