Odd behaviour using float pointer vs. creating a float variable on ARM

Question

I have created an uint8_t array which has data packed into a dense structure which the microcontroller parses to extract variables. I have found a workaround for now, but I am curious as to why my previous code might have failed.

The following function is what I was trying to use:

/*
 * Input Configuration Buffer:
 *      Byte 0:
 *          Bit 0: Input Joystick Left (0) or Input Joystick Right (1)
 *          Bit 1: Invert X-Axis (1)
 *          Bit 2: Invert Y-Axis (1)
 *          Bit 3: Output Joystick Left (0) or Output Joystick Right (1)
 *          Bits 4-7: Don't Care
 *      Byte 1: X-Deadzone 4th-Byte (float)
 *      Byte 2: X-Deadzone 3rd-Byte (float)
 *      Byte 3: X-Deadzone 2nd-Byte (float)
 *      Byte 4: X-Deadzone 1st-Byte (float)
 *      Byte 5: Y-Deadzone 4th-Byte (float)
 *      Byte 6: Y-Deadzone 3rd-Byte (float)
 *      Byte 7: Y-Deadzone 2nd-Byte (float)
 *      Byte 8: Y-Deadzone 1st-Byte (float)
 */
void Controller_Config_MapInputJoystickAsJoystick(Controller_HandleTypeDef *c, uint8_t *ic_buffer){
    uint8_t js_in = GET_BIT(ic_buffer[0], 0);
    uint8_t invert_x = GET_BIT(ic_buffer[0], 1);
    uint8_t invert_y = GET_BIT(ic_buffer[0], 2);
    uint8_t js_out = GET_BIT(ic_buffer[0], 3);
    float *deadzone_x = (float *)(&ic_buffer[1]);
    float *deadzone_y = (float *)(&ic_buffer[5]);
    float val_x = invert_x ? -joysticks[js_in].x.val : joysticks[js_in].x.val;
    float val_y = invert_y ? -joysticks[js_in].y.val : joysticks[js_in].y.val;
    if((val_x > *deadzone_x) || (val_x < -*deadzone_x)){
        c->joysticks._bits[js_out*2 + 0] += (int16_t)(val_x * -INT16_MIN);
    }
    if((val_y > *deadzone_y) || (val_y < -*deadzone_y)){
        c->joysticks._bits[js_out*2 + 1] += (int16_t)(val_y * -INT16_MIN);
    }
}

When this code runs, it seems to run fine until it tries to compare (val_x > *deadzone_x) || (val_x < -*deadzone_x). At this point, the microcontroller enters a HardFault condition.

This code is compiled as (asm):

ldr     r3, [r7, #24]
vldr    s15, [r3]
vldr    s14, [r7, #16]
vcmpe.f32       s14, s15
vmrs    APSR_nzcv, fpscr
bgt.n   0x80022c6 <Controller_Config_MapInputJoystickAsJoystick+230>
ldr     r3, [r7, #24]
vldr    s15, [r3]
vneg.f32        s15, s15
vldr    s14, [r7, #16]
vcmpe.f32       s14, s15
vmrs    APSR_nzcv, fpscr
bpl.n   0x8002302 <Controller_Config_MapInputJoystickAsJoystick+290>

In the assembly above, the microcontroller HardFaults on the instr, vldr s15, [r3] at which point, the s15 register is 1.00000238 and the r3 register is 0x802007d (This is the address of the float *deadzone_x).

I am concerned as to why this might be failing? I would love to use this function using a float * without casting a new float variable of a pointer. The instruction difference is considerable and would save a lot of processing time.

I don't know whether this is a general C programming misconception or perhaps something to do with the compiled ARM assembly? Any help would be appreciated.

I can provide you with any data, views or whatever might help. Thanks.

Edit 1:

This is workaround I am using, not much difference. I just dereference the pointer. You hate to C it.

/*
 * Input Configuration Buffer:
 *      Byte 0:
 *          Bit 0: Input Joystick Left (0) or Input Joystick Right (1)
 *          Bit 1: Invert X-Axis (1)
 *          Bit 2: Invert Y-Axis (1)
 *          Bit 3: Output Joystick Left (0) or Output Joystick Right (1)
 *          Bits 4-7: Don't Care
 *      Byte 1: X-Deadzone 4th-Byte (float)
 *      Byte 2: X-Deadzone 3rd-Byte (float)
 *      Byte 3: X-Deadzone 2nd-Byte (float)
 *      Byte 4: X-Deadzone 1st-Byte (float)
 *      Byte 5: Y-Deadzone 4th-Byte (float)
 *      Byte 6: Y-Deadzone 3rd-Byte (float)
 *      Byte 7: Y-Deadzone 2nd-Byte (float)
 *      Byte 8: Y-Deadzone 1st-Byte (float)
 */
void Controller_Config_MapInputJoystickAsJoystick(Controller_HandleTypeDef *c, uint8_t *ic_buffer){
    uint8_t js_in = GET_BIT(ic_buffer[0], 0);
    uint8_t invert_x = GET_BIT(ic_buffer[0], 1);
    uint8_t invert_y = GET_BIT(ic_buffer[0], 2);
    uint8_t js_out = GET_BIT(ic_buffer[0], 3);
    float deadzone_x = (float)(*(float *)(&ic_buffer[1]));
    float deadzone_y = (float)(*(float *)(&ic_buffer[5]));
    float val_x = invert_x ? -joysticks[js_in].x.val : joysticks[js_in].x.val;
    float val_y = invert_y ? -joysticks[js_in].y.val : joysticks[js_in].y.val;
    if((val_x > deadzone_x) || (val_x < -deadzone_x)){
        c->joysticks._bits[js_out*2 + 0] += (int16_t)(val_x * -INT16_MIN);
    }
    if((val_y > deadzone_y) || (val_y < -deadzone_y)){
        c->joysticks._bits[js_out*2 + 1] += (int16_t)(val_y * -INT16_MIN);
    }
}

Debugger Output of Working Code

Debugger Output of Non-Working Code

Edit 2:

Ill-Alignment

Your pointers are likely ill-aligned: `float *deadzone_x = (float *)(&ic_buffer[1]);` Depending on CPU architecture, different types of data must be located at addresses that apply some alignment, e.g. 32bit values need to be located on multiple of 2 or 4 bytes. That is why structures can contain padding and why such a cast is not portable. You might need to extract the content of your buffer byte wise. — Gerhardh, Feb 16 '22 at 16:49
I was thinking this as well, but during debug, the float pointer is properly interpreting the data. In this instance, I wanted `*deadzone_x` to equal `0.05f` and the code found it. The data structure is compiled on my PC and then sent via USB to the microcontroller. I made sure that zero padding was enable and little-endian. I will edit the post to show the workaround. — Tanner Hollis, Feb 16 '22 at 17:07
The debugger is able to read bytewise. Your CPU will issue a 32 bit access instead. I am not sure what you mean by "zero padding" but as you have a `float` starting at offset 1, these are obviously not aligned. Do you mean "no padding"? Anyway I am surprised that assigning the content to a `float` variable makes a difference. The source for this assignment is also ill-aligned in the same way. — Gerhardh, Feb 16 '22 at 17:16
I think you're onto something, in the non-working code the instr `vldr` calls a load of the float located at `0x802007d` (ill-aligned). This fails. When I ran the working code, the float is then aligned in RAM at address `0x2001ff30`. This passes. Is this what you're referring to? (See edit) — Tanner Hollis, Feb 16 '22 at 17:46
Not really, the `float` variable itself is properly aligned, of course. But that is not the issue. Your buffer is still at `0x802007c` which will cause `loat deadzone_x = (float)(*(float *)(&ic_buffer[1]));` to read from `0x802007d`. BTW: The first cast is not required at all. Dereferencing a `float*` already produces a `float`. — Gerhardh, Feb 16 '22 at 18:13

John Bollinger · Answer 1 · 2022-02-16T19:57:54.303

I am concerned as to why this might be failing?

Very likely, @Gerhardh has it right in comments: your particular conversions of uint8_t * to type float * result in pointers that are not correctly aligned for type float (alignment requirements are type- and architecture-specific). C has this to say about that:

A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined.

(C17, 6.3.2.3/7)

Your debugging demonstrates that the address in question is odd, whereas it is entirely plausible that your hardware requires float accesses to be aligned on 4-byte boundaries.

Moreover, accessing a slice of your byte array as if it were a float also violates C's strict aliasing rule (C17 6.5/7):

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

a type compatible with the effective type of the object,

a qualified version of a type compatible with the effective type of the object,

a type that is the signed or unsigned type corresponding to the effective type of the object,

a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,

an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or

a character type.

Undefined behavior results on account of this violation as well. In a sense, this is the more fundamental issue, because if you were not violating the SAR then there would be no opportunity for a pointer conversion with a misaligned result.

I would love to use this function using a float * without casting a new float variable of a pointer. The instruction difference is considerable and would save a lot of processing time.

The difference between the code that breaks and the one that doesn't is not casting, but the fact that you immediately dereference the converted pointer, as opposed to storing it in a variable and dereferencing it later. The (float) casts are redundant. The immediate dereferencing does not make your program's behavior defined, but apparently it does induce the compiler to produce code that works correctly.

The difference in instruction count between machine code that does not work and machine code that does work is absolutely irrelevant. You do not save processing time in any useful sense by executing code that causes a hard fault. Furthermore, greater instruction count does not necessarily mean slower code.

In any case, the correct thing to do here as far as the C language is concerned is to copy the data out to your float byte by byte. If you have the C library available then memcpy() would be the conventional way to do this, and many compilers will recognize that usage and optimize the function call away:

    float deadzone_x;
    memcpy(&deadzone_x, ic_buffer + 1, sizeof(deadzone_x));

If you are building for a freestanding implementation that does not provide memcpy(), then I would write my own for this purpose:

inline void my_memcpy(void *dest, const void *src, size_t size) {
    char *d = dest;
    const char *s = src;
    for (size_t i = 0; i < size; i++) {
        *d++ = *s++;
    }
}

Probably the compiler will indeed inline that for you (not guaranteed by the inline keyword), but if not then you could inline it manually.

Of course, if you were writing to the packed buffer then you would need to perform a similar copy in the other direction.

On a somewhat larger scale, it is possible that you would benefit from unpacking the array into a standard C structure, once, and then passing around and working with that instead of working directly on the packed buffer. This is of course very situation dependent.

Thank you for this reply, I was not aware of any of these constructs about the C language. I didn't even notice I was using the float cast for a float declaration, I was neck deep in C pointers and dereferencing already. I do wonder what the compiler actually did to make this work. To your point about copying byte by byte to the float. I was always under the impression that a for loop would generate more variables and increase the instruction count. Or perhaps this is what he `inline` keyword is for? — Tanner Hollis, Feb 16 '22 at 18:36
And to your structure comment, there are 28 different configurations that are executed on the fly. Without creating 28 different structure for each configuration, I will take the hit on execution time by extracting variables on the fly. — Tanner Hollis, Feb 16 '22 at 18:45
@TannerHollis, the immediate dereference in the second version of your program probably causes the compiler to generate code for a bytewise copy similar to what the function I presented performs. This would be either for optimization purposes or for working-properly purposes, which brings us back around to the points that (i) more instructions does not necessarily imply worse performance, and (ii) short, broken code is categorically worse than longer, working code. — John Bollinger, Feb 16 '22 at 19:54
@TannerHollis: If you can't avoid unaligned `float` data, you *could* `memcpy(&myfloat, p, sizeof(float))` to force the compiler to do whatever is necessary for an unaligned load. (This may not be efficient if targeting HW that doesn't do it for free! Another thing you can do is GNU C `typedef float unaligned_float __attribute__((aligned(1))` and use `unaligned_float *`; see the later section of [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) - that may be less inefficient than a non-inline memcpy.) — Peter Cordes, Feb 16 '22 at 19:57

Odd behaviour using float pointer vs. creating a float variable on ARM

1 Answers1