Using memoryview creates duplicate and out-of-order writes when writing consecutive 32-bit words

Question

This is related to Python: writing to memory in a single operation which covered writing a single value to memory.

I'm revisiting this topic with a goal to efficiently write contiguous blocks of memory via Python 3.8's memoryview. In particular, writing 32-bit values direct to RAM via /dev/mem, on an ARM64 Cortex A57 CPU.

Don't run this code! It may crash your computer!

OFFSET is special, and the qemu_ram_write: output is explained below. For now it is sufficient to know that it is reporting the individual data write operations on the system bus.

>>> import os, mmap
>>> OFFSET=0x0a000000

>>> fd = os.open("/dev/mem", os.O_RDWR)   # DANGER
>>> mm = mmap.mmap(fd, 4096, offset=OFFSET)
>>> mv32 = memoryview(mm).cast("@I")

# This does a single memory write (as per the linked thread):
>>> mv32[0] = 1
qemu_ram_write: addr 0x0, data 0x1, size 0x4

# However this does two writes:
>>> data = memoryview(bytearray([1, 0, 0, 0])).cast("@I")
>>> mv32[0:1] = data
qemu_ram_write: addr 0x0, data 0x1, size 0x4
qemu_ram_write: addr 0x0, data 0x1, size 0x4

The same data is written twice!

A similar thing happens with four 32-bit values:

>>> data = memoryview(bytearray([x for x in range(4 * 4)])).cast("@I")
>>> mv32[0:0 + len(data)] = data
qemu_ram_write: addr 0x0, data 0x3020100, size 0x4
qemu_ram_write: addr 0x4, data 0x7060504, size 0x4
qemu_ram_write: addr 0x8, data 0xb0a0908, size 0x4
qemu_ram_write: addr 0xc, data 0xf0e0d0c, size 0x4
qemu_ram_write: addr 0x0, data 0x3020100, size 0x4
qemu_ram_write: addr 0x4, data 0x7060504, size 0x4
qemu_ram_write: addr 0x8, data 0xb0a0908, size 0x4
qemu_ram_write: addr 0xc, data 0xf0e0d0c, size 0x4

Notice how the four 32-bit words are written correctly, and then written again!

But this does not happen with eight 32-bit words:

>>> data = memoryview(bytearray([x for x in range(4 * 8)])).cast("@I")
>>> mv32[0:0 + len(data)] = data
qemu_ram_write: addr 0x0, data 0x3020100, size 0x4
qemu_ram_write: addr 0x4, data 0x7060504, size 0x4
qemu_ram_write: addr 0x8, data 0xb0a0908, size 0x4
qemu_ram_write: addr 0xc, data 0xf0e0d0c, size 0x4
qemu_ram_write: addr 0x10, data 0x13121110, size 0x4
qemu_ram_write: addr 0x14, data 0x17161514, size 0x4
qemu_ram_write: addr 0x18, data 0x1b1a1918, size 0x4
qemu_ram_write: addr 0x1c, data 0x1f1e1d1c, size 0x4

Notice how the words are written correctly, and only once. This is as expected. 16 32-bit words also behaves as expected.

32 words is interesting in its own way:

>>> data = memoryview(bytearray([x for x in range(4 * 32)])).cast("@I")
>>> mv32[0:0 + len(data)] = data
qemu_ram_write: addr 0x40, data 0x43424140, size 0x4
qemu_ram_write: addr 0x44, data 0x47464544, size 0x4
qemu_ram_write: addr 0x48, data 0x4b4a4948, size 0x4
qemu_ram_write: addr 0x4c, data 0x4f4e4d4c, size 0x4
qemu_ram_write: addr 0x50, data 0x53525150, size 0x4
qemu_ram_write: addr 0x54, data 0x57565554, size 0x4
qemu_ram_write: addr 0x58, data 0x5b5a5958, size 0x4
qemu_ram_write: addr 0x5c, data 0x5f5e5d5c, size 0x4
qemu_ram_write: addr 0x0, data 0x3020100, size 0x4
qemu_ram_write: addr 0x4, data 0x7060504, size 0x4
qemu_ram_write: addr 0x8, data 0xb0a0908, size 0x4
qemu_ram_write: addr 0xc, data 0xf0e0d0c, size 0x4
qemu_ram_write: addr 0x10, data 0x13121110, size 0x4
qemu_ram_write: addr 0x14, data 0x17161514, size 0x4
qemu_ram_write: addr 0x18, data 0x1b1a1918, size 0x4
qemu_ram_write: addr 0x1c, data 0x1f1e1d1c, size 0x4
qemu_ram_write: addr 0x20, data 0x23222120, size 0x4
qemu_ram_write: addr 0x24, data 0x27262524, size 0x4
qemu_ram_write: addr 0x28, data 0x2b2a2928, size 0x4
qemu_ram_write: addr 0x2c, data 0x2f2e2d2c, size 0x4
qemu_ram_write: addr 0x30, data 0x33323130, size 0x4
qemu_ram_write: addr 0x34, data 0x37363534, size 0x4
qemu_ram_write: addr 0x38, data 0x3b3a3938, size 0x4
qemu_ram_write: addr 0x3c, data 0x3f3e3d3c, size 0x4
qemu_ram_write: addr 0x60, data 0x63626160, size 0x4
qemu_ram_write: addr 0x64, data 0x67666564, size 0x4
qemu_ram_write: addr 0x68, data 0x6b6a6968, size 0x4
qemu_ram_write: addr 0x6c, data 0x6f6e6d6c, size 0x4
qemu_ram_write: addr 0x70, data 0x73727170, size 0x4
qemu_ram_write: addr 0x74, data 0x77767574, size 0x4
qemu_ram_write: addr 0x78, data 0x7b7a7978, size 0x4
qemu_ram_write: addr 0x7c, data 0x7f7e7d7c, size 0x4

The correct number of writes is performed, but the order is all over the place!

And 64 words is weird too:

>>> data = memoryview(bytearray([x for x in range(4 * 64)])).cast("@I")
>>> mv32[0:0 + len(data)] = data
qemu_ram_write: addr 0x0, data 0x3020100, size 0x4
qemu_ram_write: addr 0x4, data 0x7060504, size 0x4
qemu_ram_write: addr 0x8, data 0xb0a0908, size 0x4
qemu_ram_write: addr 0xc, data 0xf0e0d0c, size 0x4
qemu_ram_write: addr 0x10, data 0x13121110, size 0x4
qemu_ram_write: addr 0x14, data 0x17161514, size 0x4
qemu_ram_write: addr 0x18, data 0x1b1a1918, size 0x4
qemu_ram_write: addr 0x1c, data 0x1f1e1d1c, size 0x4
qemu_ram_write: addr 0x20, data 0x23222120, size 0x4
qemu_ram_write: addr 0x24, data 0x27262524, size 0x4
qemu_ram_write: addr 0x28, data 0x2b2a2928, size 0x4
qemu_ram_write: addr 0x2c, data 0x2f2e2d2c, size 0x4
qemu_ram_write: addr 0x30, data 0x33323130, size 0x4
qemu_ram_write: addr 0x34, data 0x37363534, size 0x4
qemu_ram_write: addr 0x38, data 0x3b3a3938, size 0x4
qemu_ram_write: addr 0x3c, data 0x3f3e3d3c, size 0x4
qemu_ram_write: addr 0x40, data 0x43424140, size 0x4
qemu_ram_write: addr 0x44, data 0x47464544, size 0x4
qemu_ram_write: addr 0x48, data 0x4b4a4948, size 0x4
qemu_ram_write: addr 0x4c, data 0x4f4e4d4c, size 0x4
qemu_ram_write: addr 0x50, data 0x53525150, size 0x4
qemu_ram_write: addr 0x54, data 0x57565554, size 0x4
qemu_ram_write: addr 0x58, data 0x5b5a5958, size 0x4
qemu_ram_write: addr 0x5c, data 0x5f5e5d5c, size 0x4
qemu_ram_write: addr 0x60, data 0x63626160, size 0x4
qemu_ram_write: addr 0x64, data 0x67666564, size 0x4
qemu_ram_write: addr 0x68, data 0x6b6a6968, size 0x4
qemu_ram_write: addr 0x6c, data 0x6f6e6d6c, size 0x4
qemu_ram_write: addr 0x70, data 0x73727170, size 0x4
qemu_ram_write: addr 0x74, data 0x77767574, size 0x4
qemu_ram_write: addr 0x78, data 0x7b7a7978, size 0x4
qemu_ram_write: addr 0x7c, data 0x7f7e7d7c, size 0x4
qemu_ram_write: addr 0x80, data 0x83828180, size 0x4
qemu_ram_write: addr 0x84, data 0x87868584, size 0x4
qemu_ram_write: addr 0x88, data 0x8b8a8988, size 0x4
qemu_ram_write: addr 0x8c, data 0x8f8e8d8c, size 0x4
qemu_ram_write: addr 0x90, data 0x93929190, size 0x4
qemu_ram_write: addr 0x94, data 0x97969594, size 0x4
qemu_ram_write: addr 0x98, data 0x9b9a9998, size 0x4
qemu_ram_write: addr 0x9c, data 0x9f9e9d9c, size 0x4
qemu_ram_write: addr 0xa0, data 0xa3a2a1a0, size 0x4
qemu_ram_write: addr 0xa4, data 0xa7a6a5a4, size 0x4
qemu_ram_write: addr 0xa8, data 0xabaaa9a8, size 0x4
qemu_ram_write: addr 0xac, data 0xafaeadac, size 0x4
qemu_ram_write: addr 0xb0, data 0xb3b2b1b0, size 0x4
qemu_ram_write: addr 0xb4, data 0xb7b6b5b4, size 0x4
qemu_ram_write: addr 0xb8, data 0xbbbab9b8, size 0x4
qemu_ram_write: addr 0xbc, data 0xbfbebdbc, size 0x4
qemu_ram_write: addr 0xc0, data 0xc3c2c1c0, size 0x4
qemu_ram_write: addr 0xc4, data 0xc7c6c5c4, size 0x4
qemu_ram_write: addr 0xc8, data 0xcbcac9c8, size 0x4
qemu_ram_write: addr 0xcc, data 0xcfcecdcc, size 0x4
qemu_ram_write: addr 0xc0, data 0xc3c2c1c0, size 0x4 *
qemu_ram_write: addr 0xc4, data 0xc7c6c5c4, size 0x4 *
qemu_ram_write: addr 0xc8, data 0xcbcac9c8, size 0x4 *
qemu_ram_write: addr 0xcc, data 0xcfcecdcc, size 0x4 *
qemu_ram_write: addr 0xd0, data 0xd3d2d1d0, size 0x4
qemu_ram_write: addr 0xd4, data 0xd7d6d5d4, size 0x4
qemu_ram_write: addr 0xd8, data 0xdbdad9d8, size 0x4
qemu_ram_write: addr 0xdc, data 0xdfdedddc, size 0x4
qemu_ram_write: addr 0xe0, data 0xe3e2e1e0, size 0x4
qemu_ram_write: addr 0xe4, data 0xe7e6e5e4, size 0x4
qemu_ram_write: addr 0xe8, data 0xebeae9e8, size 0x4
qemu_ram_write: addr 0xec, data 0xefeeedec, size 0x4
qemu_ram_write: addr 0xf0, data 0xf3f2f1f0, size 0x4
qemu_ram_write: addr 0xf4, data 0xf7f6f5f4, size 0x4
qemu_ram_write: addr 0xf8, data 0xfbfaf9f8, size 0x4
qemu_ram_write: addr 0xfc, data 0xfffefdfc, size 0x4

If you count them, there are 68 (not 64) 4-byte writes above. The four that are duplicated are marked with a *.

I've done some more testing and the results are interesting:

Number of 32-bit words to write	Actual number of 32-bit writes	Comment
1	2	Doubled
2	4	Doubled
4	8	Doubled
8	8	OK
16	16	OK
32	32	Out of order
64	68	Extra 4 writes
128	132	Extra 4 writes
230	244	Extra 14 writes!
256	260	Extra 4 writes

I'm running this in a guest in QEMU 7.0.0. I am getting these qemu_ram_write log entries from a custom QEMU device that I created, as part of the QEMU system host binary, that uses the internal QEMU memory_region_init_io (MMIO) API to hook up a callback to the .write operation at "physical RAM" offset 0x0a000000.

I have also verified this with a real Xilinx FPGA, on a Zynq platform, with an AXI Lite bus, logging the write transactions as they appear on the bus. I see the same unusual behaviour.

If I use busybox devmem 0x0a000000 32 0 or busybox devmem 0x0a000000 64 0 I see just one or two writes, as expected:

root@qemuarm64:~# devmem 0x0a000000 32 0x01020304
qemu_ram_write: addr 0x0, data 0x1020304, size 0x4
root@qemuarm64:~# devmem 0x0a000000 64 0x0102030405060708
qemu_ram_write: addr 0x0, data 0x5060708, size 0x4
qemu_ram_write: addr 0x4, data 0x1020304, size 0x4

Based on this, and the previously linked question, I'm looking very suspiciously at Python's memoryview, or possibly mmap.mmap. So what is going on here? Why isn't memoryview behaving in the manner I would expect it to? What's with the out-of-order and extra writes?

Note: This is observed on ARM64. I haven't tested this with x86-64 yet, which I can only really do with QEMU (no FPGA available). If I do this I will report back.

score 0 · Answer 1 · answered Jun 05 '22 at 07:01

I'm not familiar with the Bus semantics of the ARM64 Cortex A57 CPU, as I've worked mostly with x86-64 processors. But this sounds suspiciously like the CPU optimization known as write-combining (WC). WC is a performance optimization where writes to specific address spaces (known as Memory Type UCWC for Uncached-Write Combining) can be "combined" into a (usually) 64-byte buffer as they arrive in any order, until the buffer is full, or "flushed" by Fence instructions, or, when certain operations happen (H/W interrupt, other UC reads/writes, etc.). The buffer is then flushed using full 64-bit 8-byte wide writes. The WC optimization is designed to utilize the CPU bus in a more optimal way than the S/W may be doing (for example, the S/W is issuing 8-bit byte writes, "repmovsb"). When the WC buffer is "flushed" writes can appear out of order, and "partial-writes" can occur if the entire buffer isnt written by S/W (to preserve existing contents). WC buffers are aligned on 64-byte boundaries so writes spanning two buffers will certainly flush the first and quite-possibly the second may be flushed also, by any of the events mentioned above. What I don't understand from your example though, is the WC buffers are implemented internal to the CPU, so normally you wouldn't see anything on an analyzer until the buffer is flushed. So that part confuses me, from your examples. Is there any virtualization occurring that would record writes that happen (even when internal to CPU)? That might explain why all writes are seen. Then again, as I mentioned, I'm not familiar with the semantics of ARM64 Cortex A57 CPU's. BTW- the address space used (I think I saw) was 0xA000000 (is this the Physical address?). 0xA000000P is the usually reserved for video-ram address space (check the PCIe Base Address Registers ("BAR's"). Note that the video ram is almost always setup as UCWC type, since writes to video memory are not sensitive to ordering, and can happen in any sequence. System firmware sets this address space up when doing BIOS initialization. Also note: there are "Streaming Store" instructions that can treat normally mapped WB-memory (i.e. write-back cached or "WB") with WC semantics, this is an optimization designed to preserve the CPU's caches from becoming "polluted" or filled with unimportant data, since writing to WB-cached memory will trigger CPU cache-fills reading memory when the writes to the destination occurs.

ctiggs · Answer 2 · 2022-06-06T19:10:40.070

I forgot to add- it would be probably best to use a tool that can convert python into it's bytecode representation, and from there use an disassembler (if possible) to derive the assembly-language instructions. for example: there should be only 1 write to memory in "bytecode" representation (i.e. the python mv32[0] = 1 write). Bottom line, if the CPU is generating more writes on it's bus than your disassembler or "bytecode" shows, it is very likely that some CPU level optimization (such as WC) is occurring.

Using memoryview creates duplicate and out-of-order writes when writing consecutive 32-bit words

2 Answers2

Linked