Intro
I am going to write my own FORTH "engine" in GNU assembler (GAS) for Linux x86-64 (specifically for AMD Ryzen 9 3900X that is siting on my table).
(If it will be success, I may use similar idea for make firmware for retro 6502 and similar home-brewed computer)
I want to add some interesting debugging features, as saving comments with the compiled code in for of "NOP words" with attached strings, which would do nothing in runtime, but when disassembling/printing out already defined words it would print those comment too, so it would not loose all the headers ( a b -- c) and comments like ( here goes this particular little trick ) and I would be able try to define new words with documentation, and later print all definitions in some nice way and make new library from those, which I consider good. (And have switch to just ignore comments for "production release")
I had read too much of optimalization here and I am not able to understand all of that in few weeks, so I will put out microoptimalisation until it will suffer performance problems and then I will start with profiling.
But I want to start with at least decent architectural decisions.
What I understood yet:
- it would be nice, if the programs was run mainly from CPU cache, not from memory
- the cache is filled somehow "automagically", but having related data/code compact and as near as possible may help a lot
- I identified some areas, that would be good candidates for caching and some, that are not so good - I sorted it in order of importance:
- assembler code - the engine and basic words like "+" - used all the time (fixed size, .text section)
- both stacks - also used all the time (dynamic, I will probably use rsp for data stack and implement return stack independly - not sure yet, which will be "native" and which "emulated")
- forth bytecode - the defined and compiled words - used at runtime, when the speed matters (still growing size)
- variables, constants, strings, other memory allocations (used in runtime)
- names of words ("DUP", "DROP" - used only when defining new words in compilation phase)
- comments (used one daily or so)
Question:
As there is lot of "heaps" that grows up (well, there is not "free" used, so it may be also stack, or stack growing up) (and two stacks that grows down) I am unsure how to implement it, so the CPU cache will cover it somehow decently.
My idea is to use one "big heap" (and increse it with brk() when needed), and then allocate big chunks of alligned memory on it, implementing "smaller heaps" in each chunk and extend them to another big chunk when the old one is filled up.
I hope, that the cache would automagically get the most used blocks first keep it most of the time and the less used blocks would be mostly ignored by the cache (respective it would occupy only small parts and get read and kicked out all the time), but maybe I did not it correctly.
But maybe is there some better strategy for that?