I have been able to use a 64-bit copy on equal sized uint32_t arrays for performance gain and wanted to do the same to a sequence of 16 uint32_t variables, from a uint32_t[16] array. I am unable to substitute to the variables with an array as it causes performance regression.
I noticed the compiler gives pointer addresses in sequence to a series of declared uint32_t variables, in reverse that is the last variable gets the lowest address and increments up by 4 bytes to the first declared variable. I tried to use the start destination address of the that final variable and cast it into a uint64_t * pointer but this did not work. Pointers for the uint32_t[16] array however are in sequence.
Here is an example of my most recent attempt.
uint32_t x00,x01,x02,x03,x04,x05,x06,x07,x08,x09,x10,x11,x12,x13,x14,x15;
uint64_t *Bu64ptr = (uint64_t *) B;
uint64_t *x15u64ptr = (uint64_t *) &x15;
/* This is an inline function that does 64-bit eqxor on two uint32_t[16]
& stores the results in uint32_t B[16]*/
salsa8eqxorload64(B,Bx);
/* Trying to 64-bit copy here */
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
*x15u64ptr++ = *Bu64ptr++;
Am I pursuing the impossible or is my lack of skill getting in the way again? I checked the pointer address value of x15 and x15u64ptr and they are completely different, using the method below.
printf("x15u64ptr %p\n", (void *) x15u64ptr);
printf("x15 %p\n", (void *) &x15);
I had one idea to create an array, and use the x?? variables as pointers to the individual elements in the array and then perform the 64-bit copy on both arrays which I hoped would assign the values to the uint32_t variables in that way but got compiler failure warning about invalid ivalue for the = assignment. Maybe I am doing something wrong in the syntax. Using 64-bit memcpy alternatives and custom 64-bit eqxor I have increased the performance of the hashing function by over 10% and expect this to give another 5-10% improvement, if I can only get it to work.
*UPDATE 13-09-2018
I ended using a struct then a neon based operation. 20% better performance to the original using 32-bit code and memcpy. I was also able to extend technique to add&save and eqxor operations that salsa20/8 uses.
struct XX
{
uint32_t x00, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12,x13,x14,x15;
} X;
//dst & src must be uint32_t[32]. Note only 8 operations, to account for "128-bit" though neon really only does 64-bit at a time.
static inline void memcpy128neon(uint32_t * __restrict dst, uint32_t * __restrict src)
{
uint32x4_t *s1 = (uint32x4_t *) dst;
uint32x4_t *s2 = (uint32x4_t *) src;
*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;*s1++ = *s2++;
}
Then invoke like this... memcpy128neon(&X.x00,arr);
Update 16-10-2018 If found this macro which allows Union Casting...
#define UNION_CAST(x, destType) \
(((union {__typeof__(x) a; destType b;})x).b)
Here is an example of creating a 1024-bit pointer using a custom type based on Arm's neon uint32x4_t vector for an array with 8 indexes, but any datatype can be used. This makes the casting compliant with strict aliasing.
uint32x4x8_t *pointer = (uint32x4x8_t *) UNION_CAST(originalpointer, uint32x4x8_t *);