Receiving RAW socket packets with microseconds level accuracy

Question

I am writing a code, which receives raw ethernet packets (no TCP/UDP) every 1ms from the server. For every packet received, my application has to reply with 14 raw packets. If the server doesn't receive the 14 packets before it sends it's packet scheduled for every 1ms, then the server raises an alarm and the application has to break out. The server-client communication is a one to one link.

The server is a hardware (FPGA) which generates packets at precise 1ms interval. The client application runs on a Linux (RHEL/Centos 7) machine with 10G SolarFlare NIC.

My first version of code is like this

while(1)
{
  while(1)
  {
     numbytes = recvfrom(sockfd, buf, sizeof(buf), 0, NULL, NULL);
     if(numbytes > 0)
     {
        //Some more lines here, to read packet number
        break;
     }
  }
  for (i=0;i<14;i++)
  {
     if (sendto(sockfd,(void *)(sym) , sizeof(sym), 0, NULL, NULL) < 0)
            perror("Send failed\n");
  }
}

I measure the receive time by taking timestamps (using clock_gettime) before the recvfrom call and one after it, I print the time differences of these timestamps and print them whenever the time difference exceeds allowable range of 900-1100 us.

The problem I am facing is that the packet receive time is fluctuating.Something like this (the prints are in microseconds)

Decode Time : 1234
Decode Time : 762
Decode Time : 1593
Decode Time : 406
Decode Time : 1703
Decode Time : 257
Decode Time : 1493
Decode Time : 514
and so on..

And sometimes the decode times exceed 2000us and application would break.

In this situation, application would break anywhere between 2 seconds to a few minutes.

Options tried by me till now.

Setting affinity to a particular isolated core.
Setting scheduling priorities to maximum with SCHED_FIFO
Increase socket buffer sizes
Setting network interface interrupt affinity to the same core which processes application
Spinning over recvfrom using poll(),select() calls.

All these options give a significant improvement over initial version of code. Now the application would run for ~1-2 hours. But this is still not enough.

A few observations:

I get a a huge dump of these decode time prints, whenever I take ssh sessions to Linux machine while the application is running (which makes me think network communication over other 1G Ethernet interface is creating interference with the 10G Ethernet interface).
The application performs better in RHEL (run times of about 2-3 hours) than Centos (run times of about 30 mins - 1.5 hours)
The run times is also varying with Linux machines with different hardware configurations with same OS.

Please suggest if there are any other methods to improve the run-time of the application.

Thanks in advance.

Besides the processing time, you need to understand that in the real world, networks will vary packet delivery times greatly. You can mitigate this to some degree if this is all on your network (doesn't travel over the Internet) if you have solid QoS policies in place, and you define priority queues for this traffic. Otherwise, I wouldn't even attempt trying to use something with such close timings on a network. — Ron Maupin, Feb 12 '16 at 08:12
I'd suggest you, if you can, to try to use a PREEMPT_RT compiled Linux kernel. — LPs, Feb 12 '16 at 08:18
It would be nice to know, what do you want to achieve, as certanily, sending packets with this precision is not feasible over Ethernet. I would suggest having another FPGA to process your data AND interface with your PC. — Koshinae, Feb 12 '16 at 08:18
@RonMaupin Actually the link is a one to one. No internet is involved. The server and client are directly connected without any router or switch. — Vikram, Feb 12 '16 at 08:18
@LPs Yes. It's full duplex. The connection is through a pair (tx & rx) of optical cables. — Vikram, Feb 12 '16 at 08:42
@Vikram Check if your network interface doesn't do interrupt coalescing. To reduce CPU load it's relatively common to delay interrupt processing on network interfaces waiting for more than one packet to arrive so that we can process more packets in one interrupt. `ethtool -c` is the tool to look at those values, but you'll need to find documentation on what this means yourself. — Art, Feb 12 '16 at 08:51
@Art Thanks for the suggestion. I tried out this ` ethtool -C ethx adaptive-rx off rx-usecs 0 rx-frames 0` since my application is latency-sensitive. This was suggested in this doc - https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf . But didn't help. — Vikram, Feb 12 '16 at 08:58
[THIS](https://blog.cloudflare.com/how-to-achieve-low-latency/) may help. — LPs, Feb 12 '16 at 09:22
@LPs Thanks for sharing the article. I'll check if they help. — Vikram, Feb 12 '16 at 09:31
You can use [perf](https://perf.wiki.kernel.org/index.php/Tutorial) to record system events that can give you more information about the internal timings of your networking within the kernel. — Zulan, Feb 12 '16 at 11:24
You can't beat FPGA. For this precise requirements, your client should also be an FPGA-on-the-NIC. This is really the only solution. — SergeyA, Feb 12 '16 at 14:40

score 1 · Answer 1 · edited May 23 '17 at 10:30

First, you need to verify the accuracy of the timestamping method; clock_gettime. The resolution is nanoseconds, but the accuracy and precision is in question. That is not the answer to your problem, but informs on how reliable the timestamping is before proceeding. See Difference between CLOCK_REALTIME and CLOCK_MONOTONIC? for why CLOCK_MONOTONIC should be used for your application.

I suspect the majority of the decode time fluctuation is either due to a variable number of operations per decode, context switching of the operating system, or IRQs.

Operations per decode I cannot comment on since the code has been simplified in your post. This issue can also be profiled and inspected.

Context switching per process can be easily inspected and monitored https://unix.stackexchange.com/a/84345

As Ron stated, these are very strict timing requirements for a network. It must be an isolated network, and single purpose. Your observation regarding decode over-time when ssh'ing indicates all other traffic must be prevented. This is disturbing, given separate NICs. Thus I suspect IRQs are the issue. See /proc/interrupts.

To achieve consistent decode times over long intervals (hours->days) will require drastically simplifying the OS. Removing unnecessary processes and services, hardware, and perhaps building your own kernel. All for the goal of reducing context switching and interrupts. At which point a real-time OS should be considered. This will only improve the probability of consistent decode time, not guarantee.

My work is developing a data acquisition system that is a combination of FPGA ADC, PC, and ethernet. Inevitably, the inconsistency of a multi-purpose PC means certain features must be moved to dedicated hardware. Consider the Pros/Cons of developing your application for PC versus moving it to hardware.

I am using `CLOCK_MONOTONIC` for taking the timestamps. And the calculated time does match with observed result. — Vikram, Feb 12 '16 at 09:37
I have isolated few cores of cpu using `isolcpus` kernel command. On checking the running processes using `ps -eF` I find that no processes run on those isolated cores except migration, ksoftirqd, kworker. I know these cannot be avoided. — Vikram, Feb 12 '16 at 09:41

Receiving RAW socket packets with microseconds level accuracy

1 Answers1