2

I made my first approach with vectorization intrinsics with SSE, where there is basically only one data type __m128i. Switching to Neon I found the data types and function prototypes to be much more specific, e.g. uint8x16_t (a vector of 16 unsigned char), uint8x8x2_t (2 vectors with 8 unsigned char each), uint32x4_t (a vector with 4 uint32_t) etc.

First I was enthusiastic (much easier to find the exact function operating on the desired data type), then I saw what a mess it was when wanting to treat the data in different ways. Using specific casting operators would take me forever. The problem is also addressed here. I then came up with the idea of an union encapsulated into a struct, and some casting and assignment operators.

struct uint_128bit_t { union {
        uint8x16_t uint8x16;
        uint16x8_t uint16x8;
        uint32x4_t uint32x4;
        uint8x8x2_t uint8x8x2;
        uint8_t uint8_array[16] __attribute__ ((aligned (16) ));
        uint16_t uint16_array[8] __attribute__ ((aligned (16) ));
        uint32_t uint32_array[4] __attribute__ ((aligned (16) ));
    };

    operator uint8x16_t& () {return uint8x16;}
    operator uint16x8_t& () {return uint16x8;}
    operator uint32x4_t& () {return uint32x4;}
    operator uint8x8x2_t& () {return uint8x8x2;}
    uint8x16_t& operator =(const uint8x16_t& in) {uint8x16 = in; return uint8x16;}
    uint8x8x2_t& operator =(const uint8x8x2_t& in) {uint8x8x2 = in; return uint8x8x2;}

};

This approach works for me: I can use a variable of type uint_128bit_t as an argument and output with different Neon intrinsics, e.g. vshlq_n_u32, vuzp_u8, vget_low_u8 (in this case just as input). And I can extend it with more data types if I need. Note: The arrays are to easily print the content of a variable.

Is this a correct way of proceeding?
Is there any hidden flaw?
Have I reinvented the wheel?
(Is the aligned attribute necessary?)

Community
  • 1
  • 1
Antonio
  • 19,451
  • 13
  • 99
  • 197

3 Answers3

3

According to the C++ Standard, this data type is nearly useless (and certainly so for the purpose you intend). That's because reading from an inactive member of a union is undefined behavior.

It is possible, however, that your compiler promises to make this work. However, you haven't asked about any particular compiler, so it is impossible to comment further on that.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Thanks for pointing this out! I added the tag gcc. What does it mean "inactive member of a union"? An union term that has not been "explicitly touched" yet? – Antonio Mar 23 '15 at 11:41
  • 2
    I _think_ that it states that you may only read from the member that was last written to. So if you have an union with two members: `foo` and `bar` and write something to `foo`, then you may only read `foo` - reading from `bar` would be undefined behavior. – simon Mar 23 '15 at 11:46
1

Since the initial proposed method has undefined behaviour in C++, I have implemented something like this:

template <typename T>
struct NeonVectorType {

    private:
    T data;

    public:
    template <typename U>
    operator U () {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
        U u;
        memcpy( &u, &data, sizeof u );
        return u;
    }

    template <typename U>
    NeonVectorType<T>& operator =(const U& in) {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
        memcpy( &data, &in, sizeof data );
        return *this;
    }

};

Then:

typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.

The use of memcpy is discussed here (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away.

If you look at the edit history, I had implemented a custom version with combine operators for vectors of vectors (e.g. uint8x8x2_t). The problem was mentioned here. However, since those data types are declared as arrays (see guide, section 12.2.2) and therefore located in consecutive memory locations, the compiler is bound to treat the memcpy correctly.

Finally, to print the content of the variable one could use a function like this.

Community
  • 1
  • 1
Antonio
  • 19,451
  • 13
  • 99
  • 197
  • This is also undefined behavior -- violation of the strict aliasing rule. – Ben Voigt Mar 23 '15 at 16:31
  • 1
    You're completely wrong. Strict aliasing is not about syntax, it's about having pointers to things you shouldn't. Your casting hacks are just as bad and undefined as the OP's union hacks. – Puppy Mar 24 '15 at 10:02
  • 1
    @Antonio: All of them, practically. That kind of pointer cast is completely undefined behaviour when T and U do not meet highly specific requirements, which is basically just `char`-related. – Puppy Mar 24 '15 at 10:22
  • @Antonio: Are you having trouble finding resources explaining what type punning and the strict aliasing rule are? There are a number of questions right here on Stack Overflow, with some very detailed answers. – Ben Voigt Mar 24 '15 at 15:17
  • @Antonio: The risk you are taking is thinking that strict aliasing violations are defined as a change going unnoticed under certain circumstances, leading to use of stale values. But strict aliasing violations aren't defined that way, or any other way. They are undefined behavior. The compiler is not required to diagnose all instances of undefined behavior. – Ben Voigt Mar 24 '15 at 15:28
  • @BenVoigt Thanks a lot for your directions. I submitted the problem [here](http://stackoverflow.com/q/29253100/2436175), and got the hint about memcpy. I will clean up the discussion here. – Antonio Mar 25 '15 at 11:03
0

If you try to avoid casting in a sensible way by various data structures hackery, you'll end up shuffling memory / words around which will kill any performance you're hoping to get from NEON.

You can probably cast down quad registers to double registers easily but other way might not be possible.

Everything boils down to this. In each instruction there are a few bits to index registers. If instruction expects Quad registers it will count registers two-by-two like Q(2*n), Q(2*n+1) and only use n in encoded instruction, (2*n+1) will be implicit for core. If any point in your code you are trying to cast two double into a quad you may be in a position where those are not consecutive forcing compiler to shuffle around registers into stack and back to get consecutive layout.

I think it is still the same answer in different words https://stackoverflow.com/a/13734838/1163019

NEON instructions are designed to be streaming, you load from memory in big chunks, process it, then store what you want back. This should be all very simple mechanics, if not you'll loose extra performance it offers which will make people ask why you're trying to utilize Neon in the first place making life harder for yourself.

Think NEON as immutable value types and operations.

Community
  • 1
  • 1
auselen
  • 27,577
  • 7
  • 73
  • 114
  • In assembler NEON instructions, or in SSE intrinsics, this conversion mess is completely avoided. I am doing image processing and the kind of conversion I am doing are (for the moment) integer to integer, so I am pretty safe. In the answer I have implemented, the instructions data type mixing are managed well (pairs of `d` registers are always coupled in a `q` register) and all memcpys are optimized away, the assembly generated looks just as it should be, my remaining doubt is if what I am doing is totally safe for my use. – Antonio Mar 26 '15 at 09:08
  • I do have a benfit in using neon, e.g. saving 75% when allowing the compiler to autovectorize, and an extra 33% when doing manual vectorization with intrinsics (total saving about 85%). Even if I am missing instructions and, towards the end, computing 16 values of which only 4 are valid and I will store in the end. – Antonio Mar 26 '15 at 09:10
  • @Antonio No one gives "totally safe" guarantee in SW world. So if it works for you keep doing what you do. For trivial cases things may just work, for complex cases you may need to do extra care / handling. – auselen Mar 26 '15 at 09:46
  • @auselan I will like to be guaranteed, apart from errors made by the compiler and from [Cosmic Rays](http://en.wikipedia.org/wiki/Cosmic_ray#Effect_on_electronics) :) :) – Antonio Mar 26 '15 at 11:13
  • I think checking the produced assembly is enough guarantee. – auselen Mar 26 '15 at 11:50