4

Theoretically, is there any way to perform any computations within the RAM, using memory related instructions such as move, clflush or whatever, such as an xor between two adjacent rows for example?

With my limited knowledge about RAM and assembly, I can't think of any such possibilities.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
agemO
  • 195
  • 8
  • 1
    While there are a few x86 instructions that can work memory to memory (like [MOVS](https://c9x.me/x86/html/file_module_x86_id_203.html)), I don't think you are going to find any that will allow you to xor one memory location against another. They are all going to require that at least 1 of the operators be a register. Why do you think you need to do this? – David Wohlferd Jun 28 '17 at 06:48
  • I don't need it, it is just a theoretical question. I don't care if the whole computation is fully memory to memory. – agemO Jun 28 '17 at 07:01
  • 1
    "I don't care if the whole computation is fully memory to memory" - Umm, then why not read the first int from the first row into a register, and `xor` it with the memory address of the first int in the 'other' row. Repeat until the row is complete. I must be missing something here. – David Wohlferd Jun 28 '17 at 07:07
  • If a few cpu operations are needed for whole rows to be computed for example – agemO Jun 28 '17 at 07:09
  • 4
    @agemO: Why don't you tell us what you want to achieve? This question doesn't make much sense, as it is. Edit it and tell what you want to do (I assume something is too slow?) and what the real problem is. – Rudy Velthuis Jun 28 '17 at 07:14
  • Nothing is too slow, as I said it is a theoretical question, for example I could never have thought about row hammer. So my question is just: is there a way to perform some kind of computations (even useless ones) using mainly the RAM ? – agemO Jun 28 '17 at 07:23
  • 1
    See: [mov is Turing-complete](https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf) – Paul R Jun 28 '17 at 07:48
  • 2
    In this context, what is a computation? – Margaret Bloom Jun 28 '17 at 07:57
  • The question is hard to formulate because I don't know the answer, maybe I should reduce to applying single gates on whole rows ? – agemO Jun 28 '17 at 08:11
  • 1
    What do you mean with "applying single gates on whole rows", or with "row hammer"? – Rudy Velthuis Jun 28 '17 at 08:51
  • @RudyVelthuis 'hammer' probably means [this](https://en.wikipedia.org/wiki/Row_hammer). 'Gates' is probably intended to be some other hardware operation relating to RAM chips. Whatever he is trying to ask, it sounds like he's approaching it from a low-level hardware background. – David Wohlferd Jun 28 '17 at 09:18
  • @DavidWohlferd: I saw the (deleted) answer that explains this, but thanks for the link. I also assumed he means hardware gates, but it still doesn't clarify what he is asking. – Rudy Velthuis Jun 28 '17 at 09:30
  • By gates I mean `xor`, `and` or whatever logical operation. I just wanted to know whether you can perform some calculation using mainly the RAM (I know you can add proccessing units into RAM (https://en.wikipedia.org/wiki/Computational_RAM) to achieve this, so I was wondering whether you could achieve anything with standard RAM. – agemO Jun 28 '17 at 09:55
  • 1
    oh, you mean like without CPU? Like the operation done inside the RAM chip itself? interesting... With some particular memory content setup you certainly can do many calculations, that's how LUT (look-up-table) works, you use *x* as offset into LUT, and the memory content there is *f(x)*, i.e. *f(x)* has been calculated. – Ped7g Jun 28 '17 at 10:18
  • So I think you gave the answer @PeterCordes, the answer seems to be a clear "no" – agemO Jun 29 '17 at 07:21
  • @agemO: turned my comments into an answer, since you've confirmed that was the kind of answer you were looking for. – Peter Cordes Jun 29 '17 at 07:41

2 Answers2

6

No, any computation is done in the CPU (or GPU, or other system devices that can load/store to RAM). Even the Turing-complete mov stuff that @PaulR linked in a comment is just using the CPU's address-generation hardware with data in registers to do calculations.

The memory still just sees 64B burst-loads and 64B burst-stores when the CPU has cache misses.

See also What Every Programmer Should Know About Memory for some background on how the DDR protocol works (send address, then transfer a burst of data to/from the RAM)


Related: is num++ atomic in C, or with x86 inc [mem]?

lock inc [mem] is actually implemented inside the CPU with a load/modify/store that the CPU makes look atomic to all possible other observers in the system (e.g. other CPU cores, and PCIe devices). But not including stuff like hooking up a logic analyzer to the memory bus, which doesn't respect the cache-coherency protocol that a CPU core uses to hold exclusive rights to a cache line while it's doing the atomic read-modify-write.

Some people thought that the add is done "inside" the memory chips, but they are mistaken. There is no adder, or even boolean AND/OR/XOR hardware in a DRAM chip (or in the interface chips that connect it to a DDR4 bus); all it can do is load or store from a given address. Any chip that can do more than that is not just DRAM.

Well obviously there's logic in the memory interface chips, but it isn't hooked up to operate on the data.

If it did have that, it would be a type of Computational RAM. (Thanks linking that in comments, BTW. Interesting computer-architecture idea. AFAIK, no mainstream CPU or GPUs use C-RAM, though.)

You can't even ask DDR4 DRAM to zero a page for you. The CPU has to do that through the memory controllers.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • BTW, can you at least set a whole row to zero or one with a constant number of cpu cycles ? – agemO Jun 29 '17 at 08:19
  • @agemO: CPU core clock cycles, no. The memory clock is separate from the CPU core clock. But usually you don't need to zero the actual DRAM, just store zeros and let caches + out-of-order execution take care of things. AMD CPUs even support [a `CLZERO` instruction that zeros a whole cache line](https://www.phoronix.com/scan.php?page=news_item&px=AMD-Zen-CLZERO-Kernel) . (Intel CPUs with AVX512 can also do this with an ordinary vector store, since AVX512 vectors are 64 bytes long = 1 cache line). Anyway, *if* you don't have any cache misses, zeroing a block of memory is fairly predictable. – Peter Cordes Jun 29 '17 at 08:31
  • @agemO: If the question was: a constant number of memory-bus cycles: then yes, I think so for identical starting conditions and a fixed set of memory timings ([CAS latency and so on](http://www.crucial.com/usa/en/memory-performance-speed-latency); those numbers like 12-14-14-28 (see also http://www.itrw-pc.com/featured/ddr4-overclocking-basics/). If the page of DRAM isn't already open, then it will take extra time before the memory controller can access it. (DRAM pages aren't the same thing or the same size as virtual memory pages). (This is what I meant by starting conditions). – Peter Cordes Jun 29 '17 at 08:35
  • I'm probably getting some of this wrong, I'm not really a DRAM expert. I know a lot of the terms, but I forget exactly what they all mean :P I mostly optimize CPU-bound stuff, or at least cache-bound, not DRAM. I do know that DRAM itself has some locality benefits (within an already-open page or row or something). And of course locality also helps cache hit-rates. – Peter Cordes Jun 29 '17 at 08:38
  • http://www.eng.utah.edu/~cs7810/pres/11-7810-12.pdf also says some stuff about banks and ranks offering memory parallelism. Anyway, in general no, writing a fixed amount of data to memory will take variable time depending on the current state of things, and the memory latency timings (separate from the memory clock speed, usually auto-configured by the mobo after reading the SPD or XMP EEPROMs on the installed DIMMs). A lot stuff about DRAM timings is also in Ulrich Drepper's What Every Programmer Should Know About Memory, IIRC, but it's been a while since I read it. – Peter Cordes Jun 29 '17 at 08:42
  • 2
    @agemO If you want to write a row in a constant number of CPU cycles you can just disable the cache and then execute the whatever constant number of memory writes it takes to zero an entire row. However I'm not sure if that is what you meant to ask. If you mean to ask if you can zero an entire row in one single operation a few cycles then the answer is no. – Ross Ridge Jun 29 '17 at 08:51
  • @RossRidge: Only if you also disable turbo and other CPU frequency-scaling things. And even then, it's not really constant time because it depends on initial conditions (whether the row as already open, or something like that), right? And disabling caching doesn't disable the store buffer, so the time would also depend on whether some other stores are still pending. – Peter Cordes Jun 29 '17 at 08:54
  • @PeterCordes Yah, it's only mostly constant. I wanted to keep it simple because I didn't think that was actually the intended question. – Ross Ridge Jun 29 '17 at 09:08
  • @RossRidge: yeah, while getting a snack just now I was thinking about our comments and realizing that maybe you were right, and the OP was really wondering if the cycle count was *predictably small*, rather than constant. `movnt` stores would be another way to bypass cache and store to DRAM (except on CPUs with an eDRAM memory-side cache, which I assume `movnt` doesn't bypass). But anyway, there's no special-case for all-zeros or all-ones (other than AMD's `clzero`). @ agemO: It is cheaper than most to generate those bit-patterns in a register to set up for a vector loop or `rep stos`... – Peter Cordes Jun 29 '17 at 09:13
  • 1
    There's at least one case on PCs where computation outside of the CPU can happen as a result of memory accesses to device RAM. VGA cards allow some simple operations to be performed on data as its being written by the CPU to video RAM. These operations however aren't performed by the video RAM itself. http://wiki.osdev.org/VGA_Hardware#Read.2FWrite_logic – Ross Ridge Jun 29 '17 at 09:26
  • 1
    Indeed I wanted to say "small", since the number of of cells in a row is itself constant. – agemO Jun 29 '17 at 11:52
4

Amazingly, yes, as per Paul R's comment: mov is Turing complete.

Such a mov only computer would be (highly) impractical however. not to mention the fact that it would be fiendishly hard to write a compiler for it. There is a c compiler that translates general purpose c programs into x86 mov instructions. Amazingly it does allows floating point calculations.
Because it is based on a Turing machine and not a Von Neumann computer it is horribly slow, (but it's a great way to obfuscate your code :-1).

For all practical purposes you can only do calculations via registers.
AFIAK only movs takes 2 memory operands, every other instruction that accesses memory uses a constant or register operand in addition.

Rowhammer is not a calculation mechanism, because it is non-deterministic.
It's also a artefact of the way dram is implemented, cache memory does not suffer from this effect.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • Is "AKIAK" supposed to be "AFAIK"? I've never heard that abbreviation and can't find it online, except as the name of a town in Alaska. Overzealous auto-correct, perhaps? – Cody Gray - on strike Jun 28 '17 at 11:11
  • 3
    [There is actually a `MOV`-only compiler for x86](https://github.com/xoreaxeaxeax/movfuscator), inspired by the cited paper by Stephen Dolan. – Cody Gray - on strike Jun 28 '17 at 11:12
  • The MOV instruction isn't Turning complete alone, you also need single JMP instruction to simulate a Turing machine. From the paper you linked "This simultes [sic] an arbitrary Turing machine, using only `mov` (and a single `jmp` to loop the program)." – Ross Ridge Jun 28 '17 at 16:54
  • 1
    Apparently it is, @Ross. The compiler I linked above does not use any JMP instructions and claims to be Turing-complete. [The only limitation](https://github.com/xoreaxeaxeax/movfuscator#mov-violations) is when you mix it with binaries generated by other less-demented toolchains, then a JMP is required. Dolan just wasn't quite imaginative enough in the original paper. – Cody Gray - on strike Jun 28 '17 at 19:25
  • 1
    @CodyGray Ehh.. its solution appears to be even worse, using a faulting MOV instruction means that whole bunch other instructions get executed during the OS handling of the fault. I think the only non-cheating method would be relying on the 64K address wrap of a 16-bit code segment but that would seriously limit the size of programs. Maybe do a 32-bit wrap combined with clever page remapping to share code and data, and really clever TLB management, though that would mean not having an OS in the way. – Ross Ridge Jun 28 '17 at 20:20
  • 1
    @CodyGray And they're definitely cheating on their choice of faulting "MOV" instruction. The instruction `mov cs, eax` doesn't exist, so it's not actually a MOV instruction. – Ross Ridge Jun 28 '17 at 20:22
  • `push [mem]` and `pop [mem]` also copy between memory locations like `movs`. `cmpsb` reads two memory operands. But AFAIK there are no proper computation instructions that take two separate memory operands, implicit or explicit. – Peter Cordes Jun 28 '17 at 21:15
  • 1
    The `mov` Turing-machine does all the computation in the CPU's address-generation unit with data from registers. That's hardly "in the RAM". It just uses RAM to hold integers, and doesn't access it in any special way. Just ordinary loads and stores. – Peter Cordes Jun 28 '17 at 21:19
  • `mov` is Turing complete but it's still done by CPU. Yet it's possible to remove CPU physically and rewire RAM somehow to make it Turing complete – l4m2 May 18 '23 at 22:47