Initialising global variables in C in Harvard CPU

Question

I build a 32-bit RISC-V CPU with Harvard architecture and I want to write programs for it in C. I have a RISC-V compiler set (https://xpack.github.io/riscv-none-embed-gcc/) that can do just that and works fine - for most things. The problem starts when I want to work with global variables, global arrays, etc, because these types get copied to RAM on boot/reset by the start script.

Here is a block diagram of my CPU: (This will be important later. Just note that the Instruction memory = FLASH and Data memory = RAM)

(If you are interested about my CPU, I made a video about it: https://www.youtube.com/watch?v=KzSaFFpBPDM)

Example:

A typical program will look something like this:

#include <stdint.h>

int static_var_1 = 2;
int static_var_2 = 4;

int main(void)
{
    int var = static_var_1 + static_var_2;
}

And its objdump something like this:

/opt/xpack-riscv-none-embed-gcc-10.1.0-1.1/riscv-none-embed/bin/objdump build/APP.elf -D

build/APP.elf:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <_start>:
   0:   00080137            lui sp,0x80
   4:   ffc10113            addi    sp,sp,-4 # 7fffc <_estack>
   8:   00c000ef            jal ra,14 <main>
   c:   0040006f            j   10 <_exit>

00000010 <_exit>:
  10:   0000006f            j   10 <_exit>

00000014 <main>:
  14:   fe010113            addi    sp,sp,-32
  18:   00812e23            sw  s0,28(sp)
  1c:   02010413            addi    s0,sp,32
  20:   00002703            lw  a4,0(zero) # 0 <_start>
  24:   00402783            lw  a5,4(zero) # 4 <static_var_2>
  28:   00f707b3            add a5,a4,a5
  2c:   fef42623            sw  a5,-20(s0)
  30:   00000793            li  a5,0
  34:   00078513            mv  a0,a5
  38:   01c12403            lw  s0,28(sp)
  3c:   02010113            addi    sp,sp,32
  40:   00008067            ret

Disassembly of section .data:

00000000 <static_var_1>:
   0:   0002                    c.slli64    zero
    ...

00000004 <static_var_2>:
   4:   0004                    0x4
    ...

Disassembly of section ._user_heap_stack:

00000008 <._user_heap_stack>:
    ...

(the <_start> is a part of my start script that will initialize stack pointer)

The catch:

These are the two instructions that tries to load the global variables:

  20:   00002703            lw  a4,0(zero) # 0 <_start>
  24:   00402783            lw  a5,4(zero) # 4 <static_var_2>

But there is a problem - they were never put into RAM, so the CPU will most likely end up with some garbage data, which is unacceptable.

The solution?

Somebody suggested linker relaxation as part of my previous question (RISC-V: Global variables), again, that doesn't seem to be the case, but I can still be wrong though!

From my research, most of the "classic" CPUs use a start script, where the copying takes place, but as this is not a von-neuman architecture, I don't have the FLASH memory mapped to data memory and therefor cannot be read by the program (see the block diagram above). The output program must contain the variables already decoded as executable instructions, for example if we want value 0x4 in RAM at position 0x0, It can be decoded to:

addi t0, zero, 0x4
sw t0, 0(zero)

Re-building my CPU as von-neuman would require much more gates and ICs and this is a discrete build where every IC counts.

Doing it by hardware is for me the worst solution as I stated above, so if it can be done in software I'm all for it - and It can! Obviously, there is a solution, but by far the ugliest: Compile the code, extract the data (with python), generate a new startup script with these variables decoded by the python script and compile it again.

I really don't want to go that route, so if it can be done by modifying startup script, linker, etc, it would be really, really great.

AVR ICs are basically Harvard architecture (though modified) so do they something differently that we can learn from?

I wonder if you can write a program to extract the .data section from the file, and convert it to assembly code. I think most real-world Harvard architectures are actually "modified Harvard" where they still have an instruction to read program memory as data. — user253751, Jul 15 '21 at 12:28
" they were never put into RAM" --> so why is that a problem with the posted program? `var` value is not used. Better to post code that does something with `var` that demos a real problem. — chux - Reinstate Monica, Jul 15 '21 at 13:08
@user253751 Yep, I briefly touched on this in the last chapter, but if I do so, I need to compile the code again, which is really not elegant — Filip, Jul 15 '21 at 13:33
@chux-ReinstateMonica I understand, but "what should it do"? the point of this example is that the value of `var` will be wrong, so any program where I need to work with this kind of variable will be wrong. For example if I define PI, ASCII table, etc, all of those will be unusable — Filip, Jul 15 '21 at 13:36
@Filip Post code that truly demos the problem. Per the C spec, this oversimplification does not certainly extrapolate to "this kind of variable will be wrong.". — chux - Reinstate Monica, Jul 15 '21 at 13:40
In your sample code there is no visible output so the compiler might have optimized out some variables. It should at least print the result to avoid those kind of optimizations. Even better, make the result dependent on a command line parameter. — dbush, Jul 15 '21 at 13:44
@dbush The compiler is currently set to -O0; to not optimize. I know it would be better, but my CPU does not really offer a print command. I can tell it to put the value to output port, but this is practically the same as leaving it in the variable by itself - I think this is a good & bare minimum example of what is going on, but thank you for the suggestion! — Filip, Jul 15 '21 at 13:51
@dbush We can see the .data section in objdump. We can see the compiler didn't delete these variables. The fact that it could have deleted these variables is irrelevant. — user253751, Jul 15 '21 at 14:19
@user253751 To your first commnent - I was hoping that something can be done right in the linker/start scripts, where I would maybe create some loop, but instead of copying the data from FLASH to RAM, I would enter the data right from the linker, but it seems as I cannot access these data at that stage in compilation — Filip, Jul 15 '21 at 15:40
@Filip I suspect you may have to modify ld to do this. I don't think it can be done with a linker script. I would try writing a program to do the conversion in the elf file instead, directly in machine code (since assembly code would mean running the assembler and linking an extra time). You can use objcopy to extract specific sections and insert specific sections or you can parse the ELF file headers. — user253751, Jul 15 '21 at 15:47
@user253751 maybe this is the only way to go. But If I modify the ELF after, there is a possibility that some instructions using absolute addressing would have to be recalculated. — Filip, Jul 15 '21 at 16:09
AFAIK, the normal way for Harvard machines to do this is via a load-program-memory instruction to allow writing a function that copies from Flash to RAM. If you can add that to your RISC-V, that would make things much more efficient, and just require one extra helper function to be called early in CRT startup with variable start/end args but fixed code. (i.e. you don't have to convert your data into blocks of `lui/addi` / `sw`) — Peter Cordes, Jul 15 '21 at 18:24
@PeterCordes Thank you for the clarification! I have thought about it and adding hardware to support loading from FLASH will totally break the flow of my CPU, with at least two dozen of additional ICs (and I'm being very generous here) and some 32b BUSes, which will drastically increase the complexity for just one instruction. The best solution here would be to make it in software, though not as elegantly as I would hope, but doable. — Filip, Jul 15 '21 at 20:31
Remember, a load-program-memory instruction doesn't have to be *efficient*, since it only has to happen once in the copy loop. It could be super slow, like serializing the pipeline, if that helps at all. Or could you add an external DMA chip (or logic block inside the CPU) that can take over both busses and do a copy given start/end ranges? Then you don't need logic paths to get flash data into CPU registers. Obviously pure software with immediates is possible, but costs 12 bytes of machine code (lui/addi/sw) for every 4 bytes of data. — Peter Cordes, Jul 15 '21 at 20:38

Initialising global variables in C in Harvard CPU

Example:

The catch:

The solution?

0 Answers0