3

I have a server with 2 Intel Xeon CPU E5-2620 (Sandy Bridge) and a 10Gbps 82599 NIC (2 ports), which I used for high-performance computing. From the PCI affinity, I see that the 10G NIC is connected to CPU1. I launched several packet receiving thread to conduct experiments, the threads receives packets, do IP/UDP parsing, and copy into a buffer. The driver I used for 10G NIC is IOEngine PacketShader/Packet-IO-Engine ยท GitHub

Q1 ! Idle CPU1 degrade CPU0 Packet receiving performance

1.1) If 1 or 2 or 4 threads are bonded to CPU0, the overal performance of all threads is about 2.6-3.2Gbps 1.2) If 2 threads are bonded to CPU1, the overal performance is 16.XGbps 1.3) If 4 threads are bonded to CPU1, the overal performance is 19.XGbps (Maximum on 2 * 10G port)

Since CPU0 is not directly connected with the NIC, it seems that the maximum receiving speed on CPU0 is 2.6-3.2Gbps. However I found if some computation intensive processes run on CPU1, the packet receiving threads on CPU0 boosts to 15.XGbps with 2 threads, and 19.XGbps with 4 threads.

Is this due to the power management? If the CPU1 is idle, it will run in the power-saving mode? Even if it is, how can CPU1 influence the performance of CPU0? Is there are something I don't know about the QPI?

Q2 ! Overloaded CPU1 degrade all Packet receiving performance

2.1) If 1 packet receiving threads runs on CPU0, and 1 packet receiving threads runs on CPU1, the overal performance is 10Gbps. The performance of each thread is almost the same -- 5.X Gbps. 2.2) If 2 packet receiving threads runs on CPU0, and 2 packet receiving threads runs on CPU1, overal performance is 13Gbps. And the performance of each thread is almost the same -- 3.X Gbps, which is lower than 2.1, 1.2, and 1.3

In short, when receiving threads running on both CPU0 and CPU1, all the threads cannot achieve their maximum performance, and their performance is almost the same.

I think that there is much I don't know about the NUMA and QPI, can anyone help me explain this ? Thanks

1 Answers1

0

Q1: Yes, that sounds like it could be due to power management. QPI has low power states, as well as the PCIe slot hanging directly off of each processor socket, the CPU cores and the processor as a whole. Details here: https://software.intel.com/en-us/articles/power-management-states-p-states-c-states-and-package-c-states

If you have access to the BIOS, try disabling QPI L-states, PEG PCIe L-states and CPU C-states. If that fixes it, you can back off some of those settings to figure out which one(s) bear the most responsibility for the performance degradation.

Q2: Intel provides some details on ordering rules and flow control for PCIe that might be relevant, but it's hard to do much to respond to them other than to know they exist and can constrain performance. There could be similar constraints in the uncore of either socket that are not publicly documented. If either of those are the case, you might be able to dig in further with VTune and see which resources are getting exhausted.

There could also be performance on the table in the synchronization scheme used in the NIC driver. VTune "Concurrency" and "Locks and Waits" analysis types could help identify and guide fixing these issues.

Aaron Altman
  • 1,705
  • 1
  • 14
  • 22