0

I want to do a 64-byte transaction on PCIe. I am using Intel i7 9th gen CPU.

I was able to do 64-byte write transaction to PCIe device memory by making it WC region and wrote data like this:

_mm256_store_si256(pcie_memory_address, ymm0); 
_mm256_store_si256(pcie_memory_address+32, ymm1);
_mm_mfence();

I tried a 64-byte read using the instruction:

_mm256_loadu_si256();

Used it as like write, but here read occurs as 2* 32-byte reads.

Can anyone help me with this? I want to do a 64-byte read as a single burst.

I referred Intel documentation from this link: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/pcie-burst-transfer-paper.pdf

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Rahul K V
  • 41
  • 4
  • As the name `WC` implies, this feature is about *write combining* memory. You can find some information about how this works [here](https://stackoverflow.com/q/49959963/3185968). Effectively, the processor has a couple of 64-byte registers that it can buffer writes (non-temporal or to `wc/uc` memory) in, so multiple separate writes (ideally) combine into a single bus transaction. The buffers don't do loads, and you don't want to load from `wc` memory if at all avoidable. Maybe AVX512 enables a single 64-byte load to cause a single bus transaction, but I'm not certain about that. – EOF May 16 '20 at 07:12
  • 1
    AFAICT, you should be able to replace `_mm256_loadu_si256()` with `_mm_stream_load_si128()` while keeping the memory `wc`. This should fetch a 64-byte cacheline in a single transaction into a fill buffer. A second aligned 32-byte load from the same cacheline should not cause a second bus transaction if the fill buffer was not evicted in-between (but you might not always be able to prevent this, depending on things like read-for-ownership of unrelated cachelines by other processors). – EOF May 16 '20 at 07:41

1 Answers1

0

As you guys said i have used the _mm256_stream_load_si256() with wc memory, i am now able to do 64 byte read also. This is how i used it

__m256i a =   _mm256_stream_load_si256 ((__m256i*)mem_base + 0);
__m256i b =   _mm256_stream_load_si256 ((__m256i*)mem_base + 1);
_mm_mfence();

Thank you guys for your help

Rahul K V
  • 41
  • 4
  • Note that inside the kernel using XMM / YMM registers is only safe inside `kernel_fpu_begin()` / `kernel_fpu_end();`. Otherwise you'll silently corrupt user-space state. Also that this requires AVX and will fault on a CPU without that. – Peter Cordes May 18 '20 at 06:55
  • i am this _mm256_stream_load_si256() in user space application, so do i need to use kernel_fpu_begin() / kernel_fpu_end(); . And why its needed? – Rahul K V May 18 '20 at 07:12
  • Oh, then your question shouldn't be tagged [tag:linux-kernel]; I fixed that for you. kernel_fpu_begin is only needed in kernel code. https://tthtlc.wordpress.com/2016/12/17/understanding-fpu-usage-in-linux-kernel/. User-space can of course run SIMD and floating-point instruction without any special setup. – Peter Cordes May 18 '20 at 07:51