combining packed data with aligned memory access

Question

I'm trying to perform a memory optimization that should be theoretically possible but that I'm starting to doubt is within arm-elf-gcc's capability. Please show me that I'm wrong.

I have an embedded system with a very small amount of main memory, and an even smaller amount of battery-backed nvram. I am storing checksummed configuration data in the nvram so that on boot I can validate the checksum and continue a previous run or start a new run if the checksum is invalid. During the run, I update various fields of various sizes in this configuration data (and it's okay that this invalidates the checksum until it is later recalculated).

All of this runs in physical address space - the normal sram is mapped at one location and the nvram is mapped at another. Here's the rub - all access to the nvram must be done in 32-bit words; no byte or halfword access is allowed (although it's obviously fine in main memory).

So I can either a) store a working copy of all of my configuration data in main memory, and memcpy it out to the nvram when I recalculate the checksum or b) work with it directly in nvram but somehow convince the compiler that all structs are packed and all accesses must not only be 32-bit aligned, but also 32-bit wide.

Option a) wastes precious main memory, and I would much rather make the runtime tradeoff to save it (although not if the code size ends up wasting more than I save on data size) via option b).

I was hoping that __attribute__ ((packed, aligned(4))) or some variation thereof could help here, but all of the reading and experimenting I have done so far has let me down.

Here's a toy example of the sort of configuration data I'm dealing with:

#define __packed __attribute__ ((packed))
struct __packed Foo
{
    uint64_t foo;
    struct FooFoo foofoo;
}

struct __packed Bar
{
    uint32_t something;
    uint16_t somethingSmaller;
    uint8_t evenSmaller;
}

struct __packed PersistentData
{
    struct Foo;
    struct Bar;
    /* ... */
    struct Baz;
    uint_32 checksum;
}

You can imagine different threads (one each to perform functions Foo, Bar, and Baz) updating their own structures as appropriate, and synchronizing at some point to declare it time to recalculate the checksum and go to sleep.

If you do go the memcpy route, do *not* rely on memcpy using 4-byte accesses to do the copy. It doesn't make any such guarantee and I had to track down a really hairy bug once related to a similar embedded system where access alignment mattered. Write your own loop to do the copy if you care about alignment. — Ben Jackson, Nov 02 '10 at 00:23

score 2 · Accepted Answer · answered Nov 02 '10 at 06:33

Avoid bitfields they are well known to be a problem with the C language, unreliable, non-portable, subject to change in implementation at any time. And wont help you with this problem anyway.

Unions come to mind as well, but I have been corrected enough times on SO that you cannot use unions to change types according to the C standards. Although as I assume with the other poster, I have not seen a case yet where using the union to change types has not worked. Broken bitfields, constantly, broken union memory sharing, so far no pain. And unions wont save you any ram so doesnt really work here.

Why are you trying to make the compiler do the work? You would need to have some sort of linker type script at compile time that instructs the compiler to do 32 bit accesses with masks, shifts, read-modify-writes, for some address spaces, and for others use the more natural word, halfword and byte accesses. I have not heard of gcc or the C language having such controls be it in the syntax, or a compiler script or definition file of some sort. And if it does exist it is not used widely enough to be reliable, I would expect compiler bugs and avoid it. I just dont see the compiler doing it, certainly not in a struct kind of manner.

For reads you might get lucky, depends heavily on the hardware folks. Where is this nvram memory interface, inside the chip made by your company, by some other company, on the edge of the chip, etc? A limitation like the one you describe in part may mean the control signals that distinguish access size or byte lanes may be ignored. So an ldrb might look to the nvram as a 32 bit read and the arm will grab the correct byte lane because it thinks it is an 8 bit read. I would do some experiments to verify this, there is more than one arm memory bus and each has many different types of transfers. Perhaps talk to the hardware folks or do some hdl simulations if you have that available to see what the arm is really doing. If you cannot take this shortcut, a read is going to be a ldr with a possible mask and shift no matter how you get the compiler to do it.

Writes other than word sized have to be read-modify-write. ldr, bic, shift, or, str. No matter who does it, you or the compiler.

Just do it yourself, I cannot see how the compiler will do it for you. Compilers including gcc have a hard enough time performing the specific access you seem to think are telling it:

*(volatile unsigned int *)(SOME_ALIGNED_ADDRESS)=some_value;

My syntax is probably wrong because I gave this up years ago, but it does not always produce an unsigned int sized store, and when the compiler doesnt want to, it wont. if it cannot do that reliably how can you expect it to create one flavor of loads and stores for this variable or struct and another flavor for that variable or struct?

So if you have specific instructions you need the compiler to produce, you will fail, you have to use assembler, period. In particular, ldm, ldrd, ldr, ldrh, ldrb, strd, str, strh, strb and stm.

I dont know how much nvram you have but it seems to me the solution to your problem is make everything in nvram 32 bits in size. You burn a few extra cycles performing the checksum, but your code space and (volatile) ram usage is at a minimum. Very very little assembly required (or none if you are comfortable with that).

I also recommend trying other compilers if you are worried about that much optimization. At a minimum try gcc 3.x, gcc 4.x, llvm, and rvct which I think there is a version that comes with Keil (but dont know how it compares to the real rvct compiler).

I dont have a feel for how small your binary has to be. If you have to pack stuff into nvram and cannot make it all 32 bit entries, I would recommend several assembler helper functions, one flavor of get32 and put32, two flavors of get16 and put16, and four flavors of get8 and put8. You will know as you are writing the code where things are packed, so you can code directly or through macros/defines which flavor of get16 or put8. These functions should only have a single parameter, so there is zero code space cost using them, performance is in the form of a pipe flush on the branch, depending on your flavor of core. What I dont know is, is this 50 or 100 instructions of put and get functions going to break your code size budget? If so I wonder if you should be using C at all. In particular gcc.

And you probably want to use thumb instead of arm if size is that critical, thumb2 if you have it.

I dont see how you would get the compiler to do it for you, would need to be some compiler specific pragma thing, which is likely to be rarely used and buggy if it exists.

What core are you using? I have been working with something in the arm 11 family with an axi bus recently and arm does a really good job of turning sequences of ldrs, ldrbs, ldrhs, etc into individual 32 or 64 bit reads (yes a few separate instructions may turn into a single memory cycle). You might just get away with tailoring your code to the features of the core, depending on the core and where this arm to nvram memory interface lies. Would have to do lots of sims for this though, I only know this by looking at the bus not from any arm documentation.

Arm 7, so no thumb2, but already using thumb and -Os. Main memory is a few hundred KB, nvram is 1 KB, so using C does actually make sense. I do now believe that all proper solutions here end up as "make explicit getters and setters". Interesting that you say unions are officially bad - I thought they were actually the only way to officially get away with type punning (not using gcc 4.x now, but have used it in the past and had to deal with strict aliasing rules). — Eric Angell, Nov 02 '10 at 17:02
I have never had problems with the union thing but got blasted on an SO response I gave. I cant find the language now but the standard says something along the lines that the items in the union must share the same memory space, but it also says that you can only use the variable last written. Which means you cannot write the double float and read back two unsigned ints to see what the bits are. per the spec, in reality I have not seen a problem...I have seen problems pointing an unsigned int pointer at a double for example which is why I use unions. — old_timer, Nov 02 '10 at 18:31
gcc 4 generates better thumb code than gcc 3.x I seem to remember, need 4.x for thumb2 which doesnt apply to you. the 4.x arm code isnt necessarily better than 3.x arm code, it varies a bit. as of around release 27 llvms output matched gcc 4.x for a few arm tests i was running. llvm out of the box is a cross compiler with many more tuning knobs including whole program optimization instead of per file/function optimization of gcc. benchmarks are subjective so ymmv. I have not made the switch wholesale but am trying to. — old_timer, Nov 02 '10 at 18:35
note, at least with release 26, the llvm gcc frontend thing was not good for cross compiling to a 32 bit (arm) target on a 64 bit computer, it made all of the ints 64 bits causing the arm to use libraries for all of its math. The -m32 switch did not work. I switched to clang and have been pleased over all. Still lots of work to go with llvm, but at least there is an alternative. — old_timer, Nov 02 '10 at 18:37
Things like bitfields and unions and other things people assume are how the C language works which are in fact implementation defined by gcc, and/or are gccisms, are going to cause big surprises if llvm or anything else gets wider use, assuming from time to time its implementation defined results are different from gccs. — old_timer, Nov 02 '10 at 18:39
for fun try -O2 instead of -Os. Sometimes just overall optimization makes it smaller than optimize for size. I saw this on green hills in particular, optimize for speed consistently produced smaller binaries than optimize for size. this was a number of years ago now, have not used their product since. — old_timer, Nov 02 '10 at 18:40
oh, also I dont think llvm's thumb code is not that strong yet, the performance tests I tried were arm instructions. ymmv. they do respond much much better to bugs being filed than gcc. with gcc the bug took months even though there were websites describing the problem. with llvm, a bug I filed in the evening against the thumb backend was fixed by noon the next day — old_timer, Nov 02 '10 at 19:38
Already been down the -O2 road and -Os is better with the combination of things I'm using. Good to hear about llvm's responsiveness. — Eric Angell, Nov 02 '10 at 22:06

score 1 · Answer 2 · answered Nov 01 '10 at 23:59

1

The simplest thing to do would be to use a union.

typedef union something {
    struct useful {
        uint8_t one;
        uint8_t two;
    };
    struct useless {
        uint32_t size_force[1];
    };
} something;
void something_memcpy(something* main_memory, something* memory_in_nvram) {
    for(int i = 0; i < sizeof(main_memory->useless.size_force); i++) {
        memory_in_nvram->useless.size_force[i] = main_memory->useless.size_force[i];
    }
}

The one is just an example - you could probably write some arithmetic to be done at compile-time to automatically determine the size. Read and write from NVRam in terms of the useless member, but always access it in main memory in terms of the "real" useful member. This should force the compiler to read and write 32bits at once (each 32bits in the array in the useless struct), but still allow you to easily and type-safely access the real data members.

answered Nov 01 '10 at 23:59

Puppy

144,682
38
256
465

I can read and write between main memory and nvram easily by simply allocating a copy of struct PersistentData on the stack or heap, and then "serializing" via `memcpy()` (which will do everything in multiples of 4 bytes at a time). That's option a) above. – Eric Angell Nov 02 '10 at 00:05
What I want to get is to never store the data in main memory so the "real" useful access always goes to nvram. If fetching a uint8, that would require an LDR of a 32-bit quantity into a register followed by masking off the top (assuming this particular uint8 is word aligned) 24 bits (LSL 24, ASR 24). If writing a uint8, it would have to be a read-modify-write. – Eric Angell Nov 02 '10 at 00:07
Eric: You don't have to keep a working copy of all of it around - you could also work with, say, just the first int of the array, or decay the array into a pointer and access that directly. The point is that while you access through something::useless, the compiler will read and write in 32bits, and you can use a union to do the conversion back to whatever real structure holds your data. – Puppy Nov 02 '10 at 00:12
Ah, I see. Essentially you're just describing a way to simplify an interface of "getters and setters" without actually creating such functions. If getters and setters were used for all data access then I wouldn't be storing an entire copy of the persistent data on the heap, I would only have a copy of the currently interesting portion on the stack. That may be the best option available. Unfortunately I already have a bunch of existing code that does the equivalent of option a) (due to a previous hardware design with different hardware restrictions on nvram). – Eric Angell Nov 02 '10 at 00:21
I was hoping to convert the existing code simply by moving the global copy of the struct from the heap to live directly in nvram and change the attributes on the struct, but I may have to actually rewrite all the code that touches its data to use getters and setters instead. – Eric Angell Nov 02 '10 at 00:22

score 1 · Answer 3 · answered Nov 02 '10 at 19:54

Since it's difficult to know what a compiler might do with a bitfield (and sometimes even a union), for safety I'd create some functions that get/set specific-sized data from arbitrary offsets using only aligned read/writes.

Something like the following (untested - not even compiled) code:

uint8_t nvram_get_u8( uint8_t const* p)
{
    uint32_t const* p32 = ((uintptr_t) p) & (~0x03);    // get a 32-bit aligned pointer
    int bit_offset = (((uintptr_t) p) & 0x03) * 8;      // get the offset of the byte 
                                                        //      we're interested in

    uint8_t val = ((*p32) >> bit_offset) & 0xff;

    return val;
}


void nvram_set_u8( uint8_t* p, uint8_t val)
{
    uint32_t* p32 = ((uintptr_t) p) & (~0x03);  // get a 32-bit aligned pointer
    int offset = (((uintptr_t) p) & 0x03) * 8;  // get the offset of the byte 
                                                //      we're interested in

    uint32_t tmp = *p32;

    tmp &= ~(((uint32_t) 0xff) << bit_offset);  // clear the byte we're writing
    tmp |= val << bit_offset;                   // and 'or' in the new data

    *p32 = tmp;

    return;
}

Now you can read/write somthing like myBar.evenSmaller (assuming that myBar has been laid out by the linker/loader such that it's in the NVRAM address space) like so:

uint8_t evenSmaller = nvram_get_u8( &myBar.evenSmaller);

nvram_set_u8( &myBar.evenSmaller, 0x5a);

Of course, the functions that deal with larger data types might be more complex since they could straddle 32-bit boundaries (if you're packing the structs to avoid unused space taken up by padding). If you're not interested in speed, they can build on the above functions that read/write single bytes at a time to help keep those functions simple.

In any case, if you have multiple threads/tasks reading writing the NVRAM concurrently, you'll need to synchronize the accesses to avoid the non-atomic writes from getting corrupted or causing corrupted reads.

score 0 · Answer 4 · answered Nov 01 '10 at 23:59

0

You can probably do it if you make everything a bitfield:

uint32_t something;
uint32_t somethingSmaller:16;
uint32_t evenSmaller:8;
uint32_t pad:8;  // not strictly necessary but will help with your sanity

However you might be outsmarted by your compiler. You'd have to check the resulting assembly.

answered Nov 01 '10 at 23:59

Ben Jackson

90,079
9
98
150

I thought that bitfields were only supported by C++? – Puppy Nov 01 '10 at 23:59
Nope, bitfields have been around in C for a long time. – Ben Jackson Nov 02 '10 at 00:02
Intriguing, but outsmarted by the compiler - it still ends up using STRB instructions. – Eric Angell Nov 02 '10 at 00:16
Well, you could try ensuring that the bitvectors *don't* align to bytes. So, throw in a `hack:1` at the top and reduce pad to `pad:7`. Then double check what is happening to the smallest access ("evenSmaller" to see if it does a 16-bit access). – Ben Jackson Nov 02 '10 at 00:21

combining packed data with aligned memory access

4 Answers4

Linked