8

I found my MMIO read/write latency is unreasonably high. I hope someone could give me some suggestions.

In the kernel space, I wrote a simple program to read a 4 byte value in a PCIe device's BAR0 address. The device is a PCIe Intel 10G NIC and plugged-in at the PCIe x16 bus on my Xeon E5 server. I use rdtsc to measure the time between the beginning of the MMIO read and the end, a code snippet looks like this:

vaddr = ioremap_nocache(0xf8000000, 128); // addr is the BAR0 of the device  
rdtscl(init); 
ret = readl(vaddr); 
rmb(); 
rdtscl(end);

I'm expecting the elapsed time between (end, init) to be less than 1us, after all, the data traversing the PCIe data link should be only a few nanoseconds. However, my test results show at lease 5.5use to do a MMIO PCIe device read. I'm wondering whether this is reasonable. I change my code to remote the memory barrier (rmb) , but still get around 5 us latency.

This paper mentions about the PCIe latency measurement. Usually it's less than 1us. www.cl.cam.ac.uk/~awm22/.../miller2009motivating.pdf‎ Do I need to do any special configuration such as kernel or device to get lower MMIO access latency? or Does anyone has experiences doing this before?

TylerH
  • 20,799
  • 66
  • 75
  • 101
William Tu
  • 301
  • 3
  • 7

2 Answers2

2

5usec is great! Do that in a loop statistically and you might find much much larger values.

There are several reasons for this. BARs are usually non-cacheable and non-prefetchable - check yours using pci_resource_flags(). If the BAR is marked cacheable then cache-coherency - the process of ensuring that all CPUs have the same value cached might be one issue.

Secondly, reading io is always a non-posted affair. The CPU has to stall until it gets permission to communicate on some data bus and stall a bit more until the data arrives on said bus. This bus is made to appear like memory but in actual fact is not and the stall might be a non-interruptable busy wait but its non-productive never-the-less. So i would expect the worst-case latency to be much higher than 5us even before you start to consider task-preemption.

toomanychushki
  • 149
  • 1
  • 6
0

If the NIC card needs to go over the network, maybe through switches, to get the data from a remote host, 5.5us is a reasonable read time. If you are reading a register in the local PCIe device, it should be less than 1us. I don't have any experience with the Intel 10G NIC, but have worked with Infiniband and custom cards.

  • I measure less than 1 us to read a word in the BAR of a devices on the local PCIe bus. Not sure why my comment was down voted since I'm just confirming the results in the paper are realistic. The BAR was mapped into user space, the we just read the address. Are you counting the ioremap_nocache() time too? As part of my job, I read registers in BARs in systems across the room, and it takes less that 5.5 us. I'm using RDMA over Mellanox FDR Infiniband with a IB switch in between the systems. – Mark Sherred Jan 12 '18 at 22:02