Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy
speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?