High performance packet handling in Linux

Question

I’m working on a packet reshaping project in Linux using the BeagleBone Black. Basically, packets are received on one VLAN, modified, and then are sent out on a different VLAN. This process is bidirectional - the VLANs are not designated as being input-only or output-only. It’s similar to a network bridge, but packets are altered (sometimes fairly significantly) in-transit.

I’ve tried two different methods for accomplishing this:

Creating a user space application that opens raw sockets on both interfaces. All packet processing (including bridging) is handled in the application.
Setting up a software bridge (using the kernel bridge module) and adding a kernel module that installs a netfilter hook in post routing (NF_BR_POST_ROUTING). All packet processing is handled in the kernel.

The second option appears to be around 4 times faster than the first option. I’d like to understand more about why this is. I’ve tried brainstorming a bit and wondered if there is a substantial performance hit in rapidly switching between kernel and user space, or maybe something about the socket interface is inherently slow?

I think the user application is fairly optimized (for example, I’m using PACKET_MMAP), but it’s possible that it could be optimized further. I ran perf on the application and noticed that it was spending a good deal of time (35%) in v7_flush_kern_dcache_area, so perhaps this is a likely candidate. If there are any other suggestions on common ways to optimize packet processing I can give them a try.

Context switches such as user mode/kernel mode/user mode are very quick in all versions of Linux. The code is found in [entry-common.S swi handler](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-common.S#n112) and [entry-common.S ret_syscall](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/kernel/entry-common.S#n28). The more likely culprit is buffer copying. See: [Zero copy vs kernel bypass](http://stackoverflow.com/questions/18343365/zero-copy-networking-vs-kernel-bypass). — artless noise, Jan 14 '15 at 21:20
That is a context switch from a user app to the kernel and back to the **SAME** user app; there is no MM (memory management) switch. The scheduler may try to run something else, so you could benefit from an RT priority. See the 'WORK_FLAG' in ret_syscall, where the re-schedule could happen. — artless noise, Jan 14 '15 at 21:28

b4hand · Answer 1 · 2015-01-14T17:51:44.467

1

Context switches are expensive and kernel to user space switches imply a context switch. You can see this article for exact numbers, but the stated durations are all in the order of microseconds.

You can also use lmbench to benchmark the real cost of context switches on your particular cpu.

edited Jan 14 '15 at 17:51

answered Jan 14 '15 at 17:44

b4hand

9,550
4
44
49

score 0 · Answer 2 · edited May 23 '17 at 10:24

0

The performance of the user space application depends on the used syscall to monitor the sockets too. The fastest syscall is epoll() when you need to handle a lot of sockets. select() will perform very poor, if you handle a lot of sockets.

See this post explaining it: Why is epoll faster than select?

edited May 23 '17 at 10:24

Community

1
1

answered Feb 09 '15 at 06:58

schorsch_76

794
5
19

Thanks for submitting an answer. My application has a single socket and uses poll() to check the socket. – nkarstens Feb 12 '15 at 19:25

High performance packet handling in Linux

2 Answers2