To complete the circle (and hopefully help someone who may be looking for a similar solution), we solved our problem by making use of libnetfilter_queue. The challenge we had was, we did not have access to source code of the application, else we could have done the fragmentation at the application level itself.
Here's the relevant excerpt from our internal document prepared by Sriram Dharwadkar, who also did the implementation. Some of the references are to our internal application names, but don't think you should have any issues in understanding.
Final Solution
NetFilter Queues is the user space library providing APIs to process the packets that have been queued by the kernel packet filter.
Application willing to make use of this functionality should link to netfilter_queue & nfnetlink dynamically and include necessary headers from sysroot-target/usr/include/libnetfilter_queue/ and
sysroot-target/usr/include/libnfnetlink/.
Iptables with NFQUEUE as the target is required to be added.
NFQUEUE is an iptables and ip6tables target which delegates the decision on packets to a userspace software. For example, the following rule will ask for a decision to a listening userspace program for all packets queued up.
iptables -A INPUT -j NFQUEUE --queue-num 0
In userspace, a software must have used libnetfilter_queue apis to connect to queue 0 (the default one) and get the messages from kernel. It then must issue a verdict on the packet.
When a packet reach an NFQUEUE target it is en-queued to the queue corresponding to the number given by the --queue-num option. The packet queue is a implemented as a chained list with element being the packet and metadata (a Linux kernel skb):
- It is a fixed length queue implemented as a linked-list of packets
- Storing packet which are indexed by an integer
- A packet is released when userspace issue a verdict to the corresponding index integer
- When queue is full, no packet can be enqueued to it
- Userspace can read multiple packets and wait for giving a verdict. If the queue is not full there is no impact of this behavior
- Packets can be verdict without order. Userspace can read packet 1,2,3,4 and verdict at 4,2,3,1 in that order
- Too slow verdict will result in a full queue. Kernel will then drop incoming packets instead of en-queuing them.
- The protocol used between kernel and userspace is nfnetlink. This is a message based protocol which does not involve any shared memory. When a packet is en-queued, the kernel sends a nfnetlink formatted message containing packet data and related information to a socket. Userspace reads this message and issues a verdict
Prefragmentation logic is implemented in AvPreFragApp (new application) as well as Security Broker (existing controller application).
In Security Broker, as soon as the tunnel is established. Following two rules are added to RAW table.
For TCP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360
Above rule negotiates a proper MSS size during three way hand shake.
It is safe to assume that, 1360+TCPH+IPH+ESP+IPH <= 1500, so that after encryption fragmentation wont happen.
For UDP prefragmentation:
/usr/sbin/iptables -t raw -I OUTPUT 2 -s <tia> -p udp -m mark ! --mark 0xbeef0000/0xffff0000 -j NFQUEUE
Above rule queues all the udp packets with src ip as TIA (tunnel address) and mark not equal to 0xbeef0000 to the netfilter queue to be processed by application. 0xbeef0000 will be marked by AvPreFragApp on all the udp packets that are queued. This is done to avoid repeated queuing of packets.
AvPreFragApp
AvPreFragApp application makes use of netfilter queues to process the packets that are queued by NFQUEUE target.
As mentioned above, iptables rule to queue the udp packets having TIA as the src ip is added in security broker. This rule is added upon tunnel establishment and updated upon tunnel bounce with the new TIA. So all the packets with TIA as the source ip are queued up for processing by AvPreFragApp.
- AvPreFragApp calls set of apis from libnetfilter_queue to setup the queue and copy the packet from kernel to the application
- While creating queue, pass the callback function address, which is called, once the packet is queued for processing
- NFQNL_COPY_PACKET mode needs to be set, it copies the whole packet from kernel to application
- File descriptor can be obtained using netfilter queue handler. Using recv function packet buffer can be obtained
- While processing the packet, AvPreFragApp checks the size of the packet. If the packet size is <= 1376. ACCEPT verdict is given. Also if the DF bit is set, ACCEPT verdict is given
- If the packet size is > 1376 and DF bit is not set, DROP verdict is given. It means the original packet is dropped. But before that the packet would have got copied to application buffer.Now AvPreFragApp does the fragmentation in application. All those fragments are written to raw socket with the mark 0xbeef0000. sendmsg is used to write packets to raw socket
- These prefragmented packets are encrypted and ESP encapsulated in kernel.
Note: TIA: Tunnel Internal Address, the logical IPSec interface.