How to include and translate custom instructions/extension on standard C/C++ code keeping performance high

Question

I'm developing a general purpose image processing core for FPGAs and ASICs. The idea is to interface a standard processor with it. One of the problems I have is how to "program" it. Let me explain: The core has a instruction decoder for my "custom" extensions. For instance:

vector_addition $vector[0], $vector[1], $vector[2]    // (i.e. v2 = v0+v1)

and many more like that. This operation is sended by the processor through the bus to the core, using the processor for loops, non-vector operations, etc, like that:

for (i=0; i<15;i++)           // to be executed in the processor
     vector_add(v0, v1, v2)   // to be executed in my custom core

Program is written in C/C++. The core only need the instruction itself, in machine code

opcode = vector_add = 0x12h
register_src_1 = v0 = 0x00h
register_src_2 = v1 = 0x01h
register_dst = v2 = 0x02h

machine code = opcore | v0 | v1 | v2 = 0x7606E600h

(or whatever, just a contatenation of different fields to build the instruction in binary)

Once sending it through the bus to the core, the core is able to request all data from memory with dedicated buses and to handle everything without use the processor. The big cuestion is: how can I translate the previous instruction to its hexadecimal representation? (send it throught the bus is not a problem). Some options that come to mind are

Run interpreted code (translate to machine code at runtime in the processor) --> very slow, even using some kind of inline macro
Compile the custom sections with an external custom compiler, load the binary from the external memory and move it to the core with some unique instruction --> hard to read/understand source code, poor SDK integration, too many sections if code is very segmented
JIT compilation --> to complex just for this?
Extending the compiler --> a nightmare!
A custom processor connected to the custom core to handle everything: loops, pointers, memory allocation, variables... --> too much work

The problem is about software/compilers, but for those that have deep knowledge in this topic, this is a SoC in an FPGA, the main processor is a MicroBlaze and the IP Core employes AXI4 buses.

I hope I explained it correctly... Thanks in advance!

Maybe I should short my cuestion... How to add new instructions to the [Code Generation](http://en.wikipedia.org/wiki/Code_generation_%28compiler%29) stage of the compiler I'm using (gcc/g++) — amnl, Jan 13 '12 at 13:36

score 1 · Answer 1 · answered Jan 13 '12 at 14:05

I'm not sure I entirely understand, but I think I've been faced with something similar before. Based on the comment to rodrigo's response it sounds like you have small instruction pieces scattered through your code. You also mention an external compiler is possible, just a pain. If you combine the external compiler with a C macro you can get something decent.

Consider this code:

for (i=0; i<15;i++)
     CORE_EXEC(vector_add(v0, v1, v2), ref1)

The CORE_EXEC macro will serve two purposes:

You can use an external tool to scan your source files for these entries and compile the core code. This code will be linked to C (just produce a C file with binary bits) using the "ref1" name as a variable.
In C you'll define the CORE_EXEC macro to pass the "ref1" string to the core for processing.

So stage 1 will produce a file of compiled binary core instructions, for example the above might have a string like this:

const char * const cx_ref1[] = { 0x12, 0x00, 0x01, 0x02 };

And you might define CORE_EXEC like this:

#define CORE_EXEC( code, name ) send_core_exec( cx_##name )

Obviously you can choose the prefixes however you want, though in C++ you might wish to use a namespace instead.

In terms of toolchain you could produce one file for all your bits or produce one file per C++ file -- which might be easier to dirty detection. Then you can simply include the generated files in your source code.

I like the idea. The "external compiler" can be just a assembler to machine-code translator, concatenating each filed of the instruction as I showed in the example of the question post. But as I replied to @rodrigo, some of the parameters are calculated at runtime. How can I face this 'special' case? — amnl, Jan 13 '12 at 20:14
It depends on how complex the parameters are. You could modify this approach to creating a __VA_ARGS__ macro, which takes the reference, function name, and parameters. Then instead of a fixed string you could have an inline function which quickly maps certain parameters to the byte equivalent. Or you could generate a unique macro for each code, and generate the full macro as well, for example `CORE_EXEC_REF1`. — edA-qa mort-ora-y, Jan 15 '12 at 13:46

score 0 · Answer 2 · answered Jan 13 '12 at 13:38

0

Couldn't you translate your all your sections of code to machine code at the start of the program (just once), save them in binary format in blocks of memory and then use those binaries when needed.

That's basically how the OpenGL shaders work, and I find that quite easy to manage.

The main drawback is the memory consumption, as you have in memory both the text and binary representation of the same scripts. I don't know if this is a problem for you. If it is, there are partial solutions, as unloading the source texts once they are compiled.

answered Jan 13 '12 at 13:38

rodrigo

94,151
12
143
190

yeah, it is a good idea, thanks! I had thought in something similar. The problem that the integration is very poor because all code have to be in the same part of the program, when in fact, it is dispersed along all program. Think on SSE instructions, it is exactly the same problem I have. – amnl Jan 13 '12 at 13:42
You can create a singleton "script factory" somehow, and identify the sources by name, or something like that. Or register them on first use. There are a lot of ways to improve the integration based on it. – rodrigo Jan 13 '12 at 14:57
I have sparse instructions all along the code, in the middle of loops, if-else, etc. and a lot of them. The problem with "precompile" them is that some arguments of these functions are calculated in the processor and then transfered inside the instruction. If the argument changes value at runtime (ie. the index of a loop), it won't work... ie. vector_add(v0, v1, #constant) – amnl Jan 13 '12 at 20:07

score 0 · Answer 3 · answered Jan 13 '12 at 15:48

Lets say I was going to modify an arm core to add some custom instructions, and the operations I wanted to run were known at compile time (will get to runtime in a sec).

I would use assembly, for example:

.globl vecabc
vecabc:
   .word 0x7606E600 ;@ special instruction
   bx lr

or inline it with whatever the inline syntax is for your compiler it makes it harder if you need to use processor registers for example where the c compiler fills in the registers in the inline assembly language then the assembler assembles those instructions. I find writing actual asm and just injecting the words in the instruction stream as above, only the compiler distingushes some bytes as data and some bytes as instructions, the core will see them in order as written.

If you need to do things real time you can use self-modifying-code, again I like to use asm to trampoline. Build the instructions you want to run somewhere in ram, say at address 0x20000000 then have a trampoline call it:

.globl tramp
tramp:
    bx r0 ;@ assuming you encoded a return in your instructions

call it with

tramp(0x20000000);

An other path related one above is to modify the assembler to add the new instructions, create a syntax for those instructions. Then you can use straight assembly language or inline assembly language at will, you wont get the compiler to use them without modifying the compiler, which is another path to take after the assembler has been modified.

mmm... Let think about that and review asm, I'm not sure I've catched it. Thanks ;) — amnl, Jan 13 '12 at 20:20

How to include and translate custom instructions/extension on standard C/C++ code keeping performance high

3 Answers3