Sending data with PACKET_MMAP and PACKET_TX_RING is slower than "normal" (without)

Question

I’m writing a traffic generator in C using the PACKET_MMAP socket option to create a ring buffer to send data over a raw socket. The ring buffer is filled with Ethernet frames to send and sendto is called. The entire contents of the ring buffer is sent over the socket which should give higher performance than having a buffer in memory and calling sendto repeatedly for every frame in the buffer that needs sending.

When not using PACKET_MMAP, upon calling sendto a single frame is copied from the buffer in the user-land memory to an SK buf in kernel memory, then the kernel must copy the packet to memory accessed by the NIC for DMA and signal the NIC to DMA the frame into it's own hardware buffers and queue it for transmission. When using the PACKET_MMAP socket option mmapped memory is allocated by the application and linked to the raw socket. The application places packets into the mmapped buffer, calls sendto and instead of the Kernel having to copy the packets into an SK buf it can read them from the mmapped buffer directly. Also "blocks" of packets can be read from the ring buffer instead of individual packets/frames. So the performance increase is one sys-call to copy multiple frames and one less copy action for each frame to get it into the NIC hardware buffers.

When I am comparing the performance of a socket using PACKET_MMAP to a “normal” socket (a char buffer with a single packet in it) there is no performance benefit at all. Why is this? When using PACKET_MMAP in Tx mode, only one frame can be put into each ring block (rather than multiple frames per ring block as with Rx mode) however I am creating 256 blocks so we should be sending 256 frames in a single sendto call right?

Performance with PACKET_MMAP, main() calls packet_tx_mmap():

bensley@ubuntu-laptop:~/C/etherate10+$ sudo taskset -c 1 ./etherate_mt -I 1
Using inteface lo (1)
Running in Tx mode
1. Rx Gbps 0.00 (0) pps 0   Tx Gbps 17.65 (2206128128) pps 1457152
2. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.08 (2385579520) pps 1575680
3. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.28 (2409609728) pps 1591552
4. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.31 (2414260736) pps 1594624
5. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.30 (2411935232) pps 1593088

Performance without PACKET_MMAP, main() calls packet_tx():

bensley@ubuntu-laptop:~/C/etherate10+$ sudo taskset -c 1 ./etherate_mt -I 1
Using inteface lo (1)
Running in Tx mode
1. Rx Gbps 0.00 (0) pps 0   Tx Gbps 18.44 (2305001412) pps 1522458
2. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.30 (2537520018) pps 1676037
3. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.29 (2535744096) pps 1674864
4. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.26 (2533014354) pps 1673061
5. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.32 (2539476106) pps 1677329

The packet_tx() function is slightly faster than the packet_tx_mmap() function it seems but it is also slightly shorter so I think that minimal performance increase is simply the slightly fewer lines of code of present in packet_tx. So it looks to me like both the functions have practically the same performance, why is that? Why isn't PACKET_MMAP much faster, as I understand it there should be far less sys-calls and copies?

void *packet_tx_mmap(void* thd_opt_p) {

    struct thd_opt *thd_opt = thd_opt_p;
    int32_t sock_fd = setup_socket_mmap(thd_opt_p);
    if (sock_fd == EXIT_FAILURE) exit(EXIT_FAILURE);

    struct tpacket2_hdr *hdr;
    uint8_t *data;
    int32_t send_ret = 0;
    uint16_t i;

    while(1) {

        for (i = 0; i < thd_opt->tpacket_req.tp_frame_nr; i += 1) {

            hdr = (void*)(thd_opt->mmap_buf + (thd_opt->tpacket_req.tp_frame_size * i));
            data = (uint8_t*)(hdr + TPACKET_ALIGN(TPACKET2_HDRLEN));

            memcpy(data, thd_opt->tx_buffer, thd_opt->frame_size);
            hdr->tp_len = thd_opt->frame_size;
            hdr->tp_status = TP_STATUS_SEND_REQUEST;

        }

        send_ret = sendto(sock_fd, NULL, 0, 0, NULL, 0);
        if (send_ret == -1) {
            perror("sendto error");
            exit(EXIT_FAILURE);
        }

        thd_opt->tx_pkts  += thd_opt->tpacket_req.tp_frame_nr;
        thd_opt->tx_bytes += send_ret;

    }

    return NULL;

}

Note that the function below calls setup_socket() and not setup_socket_mmap():

void *packet_tx(void* thd_opt_p) {

    struct thd_opt *thd_opt = thd_opt_p;

    int32_t sock_fd = setup_socket(thd_opt_p); 

    if (sock_fd == EXIT_FAILURE) {
        printf("Can't create socket!\n");
        exit(EXIT_FAILURE);
    }

    while(1) {

        thd_opt->tx_bytes += sendto(sock_fd, thd_opt->tx_buffer,
                                    thd_opt->frame_size, 0,
                                    (struct sockaddr*)&thd_opt->bind_addr,
                                    sizeof(thd_opt->bind_addr));
        thd_opt->tx_pkts += 1;

    }

}

The only difference in the socket setup functions is pasted below, but essentially its the requirements to set up a SOCKET_RX_RING or SOCKET_TX_RING:

// Set the TPACKET version, v2 for Tx and v3 for Rx
// (v2 supports packet level send(), v3 supports block level read())
int32_t sock_pkt_ver = -1;

if(thd_opt->sk_mode == SKT_TX) {
    static const int32_t sock_ver = TPACKET_V2;
    sock_pkt_ver = setsockopt(sock_fd, SOL_PACKET, PACKET_VERSION, &sock_ver, sizeof(sock_ver));
} else {
    static const int32_t sock_ver = TPACKET_V3;
    sock_pkt_ver = setsockopt(sock_fd, SOL_PACKET, PACKET_VERSION, &sock_ver, sizeof(sock_ver));
}

if (sock_pkt_ver < 0) {
    perror("Can't set socket packet version");
    return EXIT_FAILURE;
}


memset(&thd_opt->tpacket_req, 0, sizeof(struct tpacket_req));
memset(&thd_opt->tpacket_req3, 0, sizeof(struct tpacket_req3));

//thd_opt->block_sz = 4096; // These are set else where
//thd_opt->block_nr = 256;
//thd_opt->block_frame_sz = 4096;

int32_t sock_mmap_ring = -1;
if (thd_opt->sk_mode == SKT_TX) {

    thd_opt->tpacket_req.tp_block_size = thd_opt->block_sz;
    thd_opt->tpacket_req.tp_frame_size = thd_opt->block_sz;
    thd_opt->tpacket_req.tp_block_nr = thd_opt->block_nr;
    // Allocate per-frame blocks in Tx mode (TPACKET_V2)
    thd_opt->tpacket_req.tp_frame_nr = thd_opt->block_nr;

    sock_mmap_ring = setsockopt(sock_fd, SOL_PACKET , PACKET_TX_RING , (void*)&thd_opt->tpacket_req , sizeof(struct tpacket_req));

} else {

    thd_opt->tpacket_req3.tp_block_size = thd_opt->block_sz;
    thd_opt->tpacket_req3.tp_frame_size = thd_opt->block_frame_sz;
    thd_opt->tpacket_req3.tp_block_nr = thd_opt->block_nr;
    thd_opt->tpacket_req3.tp_frame_nr = (thd_opt->block_sz * thd_opt->block_nr) / thd_opt->block_frame_sz;
    thd_opt->tpacket_req3.tp_retire_blk_tov   = 1;
    thd_opt->tpacket_req3.tp_feature_req_word = 0;

    sock_mmap_ring = setsockopt(sock_fd, SOL_PACKET , PACKET_RX_RING , (void*)&thd_opt->tpacket_req3 , sizeof(thd_opt->tpacket_req3));
}

if (sock_mmap_ring == -1) {
    perror("Can't enable Tx/Rx ring for socket");
    return EXIT_FAILURE;
}


thd_opt->mmap_buf = NULL;
thd_opt->rd = NULL;

if (thd_opt->sk_mode == SKT_TX) {

    thd_opt->mmap_buf = mmap(NULL, (thd_opt->block_sz * thd_opt->block_nr), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock_fd, 0);

    if (thd_opt->mmap_buf == MAP_FAILED) {
        perror("mmap failed");
        return EXIT_FAILURE;
    }


} else {

    thd_opt->mmap_buf = mmap(NULL, (thd_opt->block_sz * thd_opt->block_nr), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock_fd, 0);

    if (thd_opt->mmap_buf == MAP_FAILED) {
        perror("mmap failed");
        return EXIT_FAILURE;
    }

    // Per bock rings in Rx mode (TPACKET_V3)
    thd_opt->rd = (struct iovec*)calloc(thd_opt->tpacket_req3.tp_block_nr * sizeof(struct iovec), 1);

    for (uint16_t i = 0; i < thd_opt->tpacket_req3.tp_block_nr; ++i) {
        thd_opt->rd[i].iov_base = thd_opt->mmap_buf + (i * thd_opt->tpacket_req3.tp_block_size);
        thd_opt->rd[i].iov_len  = thd_opt->tpacket_req3.tp_block_size;
    }


}

Update 1: Result against physical interface(s) It was mentioned that one reason I might not be seeing a performance difference when using PACKET_MMAP was because I was sending traffic to the loopback interface (which, for one thing, doesn't have a QDISC). Since running either of the packet_tx_mmap() or packet_tx() routines can generate more than 10Gbps and I only have 10Gbps interfaces at my disposal I have bonded two together and these are the results, which show pretty much the same as above, there is minimal speed difference between the two functions:

packet_tx() to 20G bond0

1 thread: Average 10.77Gbps~ / 889kfps~
2 threads: Average 19.19Gbps~ / 1.58Mfps~
3 threads: Average 19.67Gbps~ / 1.62Mfps~ (this is as fast as the bond will go)

packet_tx_mmap() to 20G bond0:

1 thread: Average 11.08Gbps~ / 913kfps~
2 threads: Average 19.0Gbps~ / 1.57Mfps~
3 threads: Average 19.66Gbps~ / 1.62Mfps~ (this is as fast as the bond will go)

This was with frames 1514 bytes in size (to keep it the same as the original loopback tests above).

In all of the above tests the number of soft IRQs was roughly the same (measured using this script). With one thread running packet_tx() there was circa 40k interrupts per second on a CPU core. With 2 and 3 threads running there 40k on 2 and 3 core respectively. The results when using packet_tx_mmap() where the same. Circa 40k soft IRQs for a single thread on one CPU core. 40k per core when running 2 and 3 threads.

Update 2: Full Source Code

I have uploaded the full source code now, I'm still writing this application so it probably has plenty of flaws but they are outside the scope of this question: https://github.com/jwbensley/EtherateMT

How fast is your network? How large is your framesize? Are you maybe simply saturating your link? Have you checked the actual (autonegotiated) bitrate? — maxy, Apr 07 '17 at 20:02
The frame size is 1514 octets with headers, I am sending traffic to the loopback interface lo as shown in the output. I am sending traffic to the loopback interface to eliminate the NIC as a sauce of issues. — jwbensley, Apr 07 '17 at 20:22
My understanding is that because the `packet_tx_mmap` function should be sharing a buffer with the kernel meaning multiple packets are copied from userland to kernelland in a single `sendto()` syscall, so sending traffic to the loopback interface means we are testing that aspect specifically and not worrying about DMA'ing the packets to a NIC which would be the same process for both `packet_tx` and `packet_tx_mmap` because that is further down the kernel stack. — jwbensley, Apr 07 '17 at 20:31
For `send_ret = sendto(sock_fd, NULL, 0, 0, NULL, 0);` in the `packet_tx_mmap()` function, I have changed the flag from `0` to `MSG_DONTWAIT` and it made no difference. `MSG_DONTWAIT` should be non-blocking as you say, but I guess the reason I saw no performance change is because on the next iteration of the loop calling `sendto()` again will mean no more data is actually sent out of the NIC unless the NIC queue has space surely? If we bulk transfer data and fill the NIC queue non-blocking it doesn't matter than the `sendto()` call is non-blocking, if the queue is full? ... — jwbensley, Apr 12 '17 at 09:24
...So it seems to me that with and without the `MSG_DONTWAIT` flag I am filling the NIC queue maybe? Or am I misunderstanding? — jwbensley, Apr 12 '17 at 09:24
*one less copy action for each frame* - seems to me that the copy action is simply moved from the kernel to the userspace, as you do the `memcpy` in your program. — kfx, Apr 13 '17 at 12:29
You've got me interested in this; I spent a little time poking around some more. If we want to use MSG_DONTWAIT, then we need to understand how to synchronize access to the shared ring buffer between user and kernel space. In the kernel, setting and getting the packet status uses write and read barriers, respectively (see [this](http://lxr.free-electrons.com/source/net/packet/af_packet.c#L397)), so we need to be similarly careful in user space to do this. Barriers used to be defined in , but no longer. I am using [liburcu](https://lwn.net/Articles/573436/). All for today. — JimD., Apr 15 '17 at 14:06
I've been taking a very different approach (which likely stems from my vastly inferior knowledge of the Linux kernel and how to debug it). I have just started tracing through the code for the `socket()` and `sendto()` calls to see where a call to `sendto` forks off for a socket created with an `mmap()`'ed TX ring and a "normal" packet buffer socket: https://github.com/jwbensley/EtherateMT/wiki/Linux-Kernel-socket()-&-sendto()-Tracing — jwbensley, Apr 15 '17 at 16:50
@JimD. thanks for all your feed back, I'll read through the various links. Im away for a few days so not might time until next week. — jwbensley, Apr 15 '17 at 16:53
Looking in af_packet.h there is a proto definition where the sendmsg proto def point to packet_sendmsg() and then tpacket_snd(); http://lxr.free-electrons.com/source/net/packet/af_packet.c#L4373 - So I think I need to find out when/where in the Kernel source the socket proto def for sendmsg is set to the proto def I have referenced, I presume my socket is currently pointing to the raw socket proto def here: http://lxr.free-electrons.com/source/net/ipv4/raw.c#L939 — jwbensley, Apr 15 '17 at 20:15
@JimD. I've updated the question with some performance statistics using a "real" interface and not the loopback interface. I will get the full code on-line next week when I'm back home (currently traveling). It looks to me that the code in af_packet.h isn't either being used or the normal packet path through a raw socket must have been improved over the years, to the point that its roughly as fast a the PACKET_MMAP method. — jwbensley, Apr 16 '17 at 10:13
Thanks! I'll give it a read. I have uploaded the code (so far) to GitHub and updated the questions with a link. — jwbensley, Apr 18 '17 at 20:25
I see that you use a `sendto` function but I don't see where packets are sent. Generally I have a problem with understanding `PACKET_MMAP`. Especially, does `PACKET_MMAP` make it possible to send a packet by `TCP/UDP` in "normal" way (like with using a common socket `AF_INET`, `SOCK_STREAM`)? — Gilgamesz, Jun 07 '18 at 06:57
@Gilgamesz You should ask a separate question for this - but yes you can send TCP/UDP by creating the socket as SOCK_DRAM, I used SOCK_RAW. `sendto()` is a system call in Linux which will eventually call `tpacket_snd()` in af_packet.c. I have traced the path of these calls from userland program into Kernel here: https://github.com/jwbensley/EtherateMT/wiki/EtherateMT-Transmit-Overview — jwbensley, Jun 09 '18 at 20:43
@Gilgamesz This is a deeper dive (but really you just need to dive in and read the Kernel source for your self or ask a new question on SO): https://github.com/jwbensley/EtherateMT/wiki/Linux-Kernel-tracing-for-sendto()-using-AF_PACKET,-PACKET_MMAP-and-PACKET_FANOUT — jwbensley, Jun 09 '18 at 20:44
@jwbensley, thanks for your response. Indeed, I was able to send UDP packet with (AF_PACKET, SOCK_DGRAM). What are your observations when it comes to a performance? — Gilgamesz, Jun 10 '18 at 10:23
@Gilgamesz - this is really a separate question but I get a steady 1Mpps per CPU core. — jwbensley, Jun 17 '18 at 16:59

JimD. · Accepted Answer · 2017-04-21T11:28:07.070

Many interfaces to the linux kernel are not well documented. Or even if they seem well documented, they can be pretty complex and that can make it hard to understanding what the functional or, often even harder, nonfunctional properties of the interface are.

For this reason, my advice to anyone wanting a solid understanding of kernel APIs or needing to create high performance applications using kernel APIs needs to be able to engage with kernel code to be successful.

In this case the questioner wants to understand the performance characteristics of sending raw frames though a shared memory interface (packet mmap) to the kernel.

The linux documentation is here. It has a stale link to a "how to," which can now be found here and includes a copy of packet_mmap.c (I have a slightly different version available here.

The documentation is largely geared towards reading, which is the typical use case for using packet mmap: efficiently reading raw frames from an interface for, e.g. efficiently obtaining a packet capture from a high speed interface with little or no loss.

The OP however is interested in high performance writing, which is a much less common use case, but potentially useful for a traffic generator/simulator which appears to be what the OP wants to do with it. Thankfully, the "how to" is all about writing frames.

Even so, there is very little information provided about how this actually works, and nothing of obvious help to answer the OPs question about why using packet mmap doesn't seem to be faster than not using it and instead sending one frame at a time.

Thankfully the kernel source is open source and well indexed, so we can turn to the source to help us get the answer to the question.

In order to find the relevant kernel code there are several keywords you could search for, but PACKET_TX_RING stands out as a socket option unique to this feature. Searching on the interwebs for "PACKET_TX_RING linux cross reference" turns up a small number of references, including af_packet.c, which with a little inspection appears to be the implementation of all the AF_PACKET functionality, including packet mmap.

Looking through af_packet.c, it appears that the core of the work for transmitting with packet mmap takes place in tpacket_snd(). But is this correct? How can we tell if this has anything to do with what we think it does?

A very powerful tool for getting information like this out of the kernel is SystemTap. (Using this requires installing debugging symbols for your kernel. I happen to be using Ubuntu, and this is a recipe for getting SystemTap working on Ubuntu.)

Once you have SystemTap working, you can use SystemTap in conjuction with packet_mmap.c to see if tpacket_snd() is even invoked by installing a probe on the kernel function tpacket_snd, and then running packet_mmap to send a frame via a shared TX ring:

$ sudo stap -e 'probe kernel.function("tpacket_snd") { printf("W00T!\n"); }' &
[1] 19961
$ sudo ./packet_mmap -c 1 eth0
[...]
STARTING TEST:
data offset = 32 bytes
start fill() thread
send 1 packets (+150 bytes)
end of task fill()
Loop until queue empty (0)
END (number of error:0)
W00T!
W00T!

W00T! We are on to something; tpacket_snd is actually being called. But our victory will be short lived. If we continue to try to get more information out of a stock kernel build, SystemTap will complain that it can't find the variables we want to inspect and function arguments will print out with values as ? or ERROR. This is because the kernel is compiled with optimization and all of the functionality for AF_PACKET is defined in the single translation unit af_packet.c; many of the functions are inlined by the compiler, effectively losing local variables and arguments.

In order to pry more information out of af_packet.c, we are going to have to build a version of the kernel where af_packet.c is built without optimization. Look here for some guidance. I'll wait.

OK, hopefully that wasn't too hard and you have successfully booted a kernel that SystemTap can get lots of good information from. Keep in mind that this kernel version is just to help us figure out how packet mmap is working. We can't get any direct performance information from this kernel because af_packet.c was build without optimization. If it turns out that we need to get information on how the optimized version would behave, we can build another kernel with af_packet.c compiled with optimization, but with some instrumentation code added that exposes information via variables that won't get optimized out so that SystemTap can see them.

So let's use it to get some information. Take a look at status.stp:

# This is specific to net/packet/af_packet.c 3.13.0-116

function print_ts() {
  ts = gettimeofday_us();
  printf("[%10d.%06d] ", ts/1000000, ts%1000000);
}

#  325 static void __packet_set_status(struct packet_sock *po, void *frame, int status)
#  326 {
#  327  union tpacket_uhdr h;
#  328 
#  329  h.raw = frame;
#  330  switch (po->tp_version) {
#  331  case TPACKET_V1:
#  332      h.h1->tp_status = status;
#  333      flush_dcache_page(pgv_to_page(&h.h1->tp_status));
#  334      break;
#  335  case TPACKET_V2:
#  336      h.h2->tp_status = status;
#  337      flush_dcache_page(pgv_to_page(&h.h2->tp_status));
#  338      break;
#  339  case TPACKET_V3:
#  340  default:
#  341      WARN(1, "TPACKET version not supported.\n");
#  342      BUG();
#  343  }
#  344 
#  345  smp_wmb();
#  346 }

probe kernel.statement("__packet_set_status@net/packet/af_packet.c:334") {
  print_ts();
  printf("SET(V1): %d (0x%.16x)\n", $status, $frame);
}

probe kernel.statement("__packet_set_status@net/packet/af_packet.c:338") {
  print_ts();
  printf("SET(V2): %d\n", $status);
}

#  348 static int __packet_get_status(struct packet_sock *po, void *frame)
#  349 {
#  350  union tpacket_uhdr h;
#  351 
#  352  smp_rmb();
#  353 
#  354  h.raw = frame;
#  355  switch (po->tp_version) {
#  356  case TPACKET_V1:
#  357      flush_dcache_page(pgv_to_page(&h.h1->tp_status));
#  358      return h.h1->tp_status;
#  359  case TPACKET_V2:
#  360      flush_dcache_page(pgv_to_page(&h.h2->tp_status));
#  361      return h.h2->tp_status;
#  362  case TPACKET_V3:
#  363  default:
#  364      WARN(1, "TPACKET version not supported.\n");
#  365      BUG();
#  366      return 0;
#  367  }
#  368 }

probe kernel.statement("__packet_get_status@net/packet/af_packet.c:358") { 
  print_ts();
  printf("GET(V1): %d (0x%.16x)\n", $h->h1->tp_status, $frame); 
}

probe kernel.statement("__packet_get_status@net/packet/af_packet.c:361") { 
  print_ts();
  printf("GET(V2): %d\n", $h->h2->tp_status); 
}

# 2088 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
# 2089 {
# [...]
# 2136  do {
# 2137      ph = packet_current_frame(po, &po->tx_ring,
# 2138              TP_STATUS_SEND_REQUEST);
# 2139 
# 2140      if (unlikely(ph == NULL)) {
# 2141          schedule();
# 2142          continue;
# 2143      }
# 2144 
# 2145      status = TP_STATUS_SEND_REQUEST;
# 2146      hlen = LL_RESERVED_SPACE(dev);
# 2147      tlen = dev->needed_tailroom;
# 2148      skb = sock_alloc_send_skb(&po->sk,
# 2149              hlen + tlen + sizeof(struct sockaddr_ll),
# 2150              0, &err);
# 2151 
# 2152      if (unlikely(skb == NULL))
# 2153          goto out_status;
# 2154 
# 2155      tp_len = tpacket_fill_skb(po, skb, ph, dev, size_max, proto,
# 2156                    addr, hlen);
# [...]
# 2176      skb->destructor = tpacket_destruct_skb;
# 2177      __packet_set_status(po, ph, TP_STATUS_SENDING);
# 2178      atomic_inc(&po->tx_ring.pending);
# 2179 
# 2180      status = TP_STATUS_SEND_REQUEST;
# 2181      err = dev_queue_xmit(skb);
# 2182      if (unlikely(err > 0)) {
# [...]
# 2195      }
# 2196      packet_increment_head(&po->tx_ring);
# 2197      len_sum += tp_len;
# 2198  } while (likely((ph != NULL) ||
# 2199          ((!(msg->msg_flags & MSG_DONTWAIT)) &&
# 2200           (atomic_read(&po->tx_ring.pending))))
# 2201      );
# 2202 
# [...]
# 2213  return err;
# 2214 }

probe kernel.function("tpacket_snd") {
  print_ts();
  printf("tpacket_snd: args(%s)\n", $$parms);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2140") {
  print_ts();
  printf("tpacket_snd:2140: current frame ph = 0x%.16x\n", $ph);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2141") {
  print_ts();
  printf("tpacket_snd:2141: (ph==NULL) --> schedule()\n");
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2142") {
  print_ts();
  printf("tpacket_snd:2142: flags 0x%x, pending %d\n", 
     $msg->msg_flags, $po->tx_ring->pending->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2197") {
  print_ts();
  printf("tpacket_snd:2197: flags 0x%x, pending %d\n", 
     $msg->msg_flags, $po->tx_ring->pending->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2213") {
  print_ts();
  printf("tpacket_snd: return(%d)\n", $err);
}

# 1946 static void tpacket_destruct_skb(struct sk_buff *skb)
# 1947 {
# 1948  struct packet_sock *po = pkt_sk(skb->sk);
# 1949  void *ph;
# 1950 
# 1951  if (likely(po->tx_ring.pg_vec)) {
# 1952      __u32 ts;
# 1953 
# 1954      ph = skb_shinfo(skb)->destructor_arg;
# 1955      BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
# 1956      atomic_dec(&po->tx_ring.pending);
# 1957 
# 1958      ts = __packet_set_timestamp(po, ph, skb);
# 1959      __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
# 1960  }
# 1961 
# 1962  sock_wfree(skb);
# 1963 }

probe kernel.statement("tpacket_destruct_skb@net/packet/af_packet.c:1959") {
  print_ts();
  printf("tpacket_destruct_skb:1959: ph = 0x%.16x, ts = 0x%x, pending %d\n",
     $ph, $ts, $po->tx_ring->pending->counter);
}

This defines a function (print_ts to print out unix epoch time with microsecond resolution) and a number of probes.

First we define probes to print out information when packets in the tx_ring have their status set or read. Next we define probes for the call and return of tpacket_snd and at points within the do {...} while (...) loop processing the packets in the tx_ring. Finally we add a probe to the skb destructor.

We can start the SystemTap script with sudo stap status.stp. Then run sudo packet_mmap -c 2 <interface> to send 2 frames through the interface. Here is the output I got from the SystemTap script:

[1492581245.839850] tpacket_snd: args(po=0xffff88016720ee38 msg=0x14)
[1492581245.839865] GET(V1): 1 (0xffff880241202000)
[1492581245.839873] tpacket_snd:2140: current frame ph = 0xffff880241202000
[1492581245.839887] SET(V1): 2 (0xffff880241202000)
[1492581245.839918] tpacket_snd:2197: flags 0x40, pending 1
[1492581245.839923] GET(V1): 1 (0xffff88013499c000)
[1492581245.839929] tpacket_snd:2140: current frame ph = 0xffff88013499c000
[1492581245.839935] SET(V1): 2 (0xffff88013499c000)
[1492581245.839946] tpacket_snd:2197: flags 0x40, pending 2
[1492581245.839951] GET(V1): 0 (0xffff88013499e000)
[1492581245.839957] tpacket_snd:2140: current frame ph = 0x0000000000000000
[1492581245.839961] tpacket_snd:2141: (ph==NULL) --> schedule()
[1492581245.839977] tpacket_snd:2142: flags 0x40, pending 2
[1492581245.839984] tpacket_snd: return(300)
[1492581245.840077] tpacket_snd: args(po=0x0 msg=0x14)
[1492581245.840089] GET(V1): 0 (0xffff88013499e000)
[1492581245.840098] tpacket_snd:2140: current frame ph = 0x0000000000000000
[1492581245.840093] tpacket_destruct_skb:1959: ph = 0xffff880241202000, ts = 0x0, pending 1
[1492581245.840102] tpacket_snd:2141: (ph==NULL) --> schedule()
[1492581245.840104] SET(V1): 0 (0xffff880241202000)
[1492581245.840112] tpacket_snd:2142: flags 0x40, pending 1
[1492581245.840116] tpacket_destruct_skb:1959: ph = 0xffff88013499c000, ts = 0x0, pending 0
[1492581245.840119] tpacket_snd: return(0)
[1492581245.840123] SET(V1): 0 (0xffff88013499c000)

And here is the network capture:

There is a lot of useful information in the SystemTap output. We can see tpacket_snd get the status of the first frame in the ring (TP_STATUS_SEND_REQUEST is 1) and then set it to TP_STATUS_SENDING (2). It does the same with the second. The next frame has status TP_STATUS_AVAILABLE (0), which is not a send request, so it calls schedule() to yield, and continues the loop. Since there are no more frames to send (ph==NULL) and non-blocking has been requested (msg->msg_flags ==MSG_DONTWAIT) the do {...} while (...) loop terminates, and tpacket_snd returns 300, the number of bytes queued for transmission.

Next, packet_mmap calls sendto again (via the "loop until queue empty" code), but there is no more data to send in the tx ring, and non-blocking is requested, so it immediately returns 0, as no data has been queued. Note that the frame it checked the status of is the same frame it checked last in the previous call --- it did not start with the first frame in the tx ring, it checked the head (which is not available in userland).

Asynchronously, the destructor is called, first on the first frame, setting the status of the frame to TP_STATUS_AVAILABLE and decrementing the pending count, and then on the second frame. Note that if non-blocking was not requested, the test at the end of the do {...} while (...) loop will wait until all of the pending packets have been transferred to the NIC (assuming it supports scattered data) before returning. You can watch this by running packet_mmap with the -t option for "threaded" which uses blocking I/O (until it gets to "loop until queue empty").

A couple of things to note. First, the timestamps on the SystemTap output are not increasing: it is not safe to infer temporal ordering from SystemTap ouput. Second, note that the timestamps on the network capture (done locally) are different. FWIW, the interface is a cheap 1G in a cheap tower computer.

So at this point, I think we more or less know how af_packet is processing the shared tx ring. What comes next is how the frames in the tx ring find their way to the network interface. It might be helpful to review this section (on how layer 2 transmission is handled) of an overview of the control flow in the linux networking kernel.

OK, so if you have a basic understanding of how layer 2 transmission is handled, it would seem like this packet mmap interface should be an enormous fire hose; load up a shared tx ring with packets, call sendto() with MSG_DONTWAIT, and then tpacket_snd will iterate through the tx queue creating skb's and enqueueing them onto the qdisc. Asychronously, skb's will be dequeued from the qdisc and sent to the hardware tx ring. The skb's should be non-linear so they will reference the data in the tx ring rather than copy, and a nice modern NIC should be able to handle scattered data and reference the data in the tx rings as well. Of course, any of these assumptions could be wrong, so lets try to dump a whole lot of hurt on a qdisc with this fire hose.

But first, a not commonly understood fact about how qdiscs work. They hold a bounded amount of data (generally counted in number of frames, but in some cases it could be measured in bytes) and if you try to enqueue a frame to a full qdisc, the frame will generally be dropped (depending on what the enqueuer decides to do). So I will give out the hint that my original hypothesis was that the OP was using packet mmap to blast frames into a qdisc so fast that many were being dropped. But don't hold too fast to that idea; it takes you in a direction, but always keep an open mind. Let's give it a try to find out what happens.

First problem in trying this out is that the default qdisc pfifo_fast doesn't keep statistics. So let's replace that with the qdisc pfifo which does. By default pfifo limits the queue to TXQUEUELEN frames (which generally defaults to 1000). But since we want demonstrate overwhelming a qdisc, let's explicitly set it to 50:

$ sudo tc qdisc add dev eth0 root pfifo limit 50
$ tc -s -d qdisc show dev eth0
qdisc pfifo 8004: root refcnt 2 limit 50p
 Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

Let's also measure how long it takes to process the frames in tpacket_snd with the SystemTap script call-return.stp:

# This is specific to net/packet/af_packet.c 3.13.0-116

function print_ts() {
  ts = gettimeofday_us();
  printf("[%10d.%06d] ", ts/1000000, ts%1000000);
}

# 2088 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
# 2089 {
# [...]
# 2213  return err;
# 2214 }

probe kernel.function("tpacket_snd") {
  print_ts();
  printf("tpacket_snd: args(%s)\n", $$parms);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2213") {
  print_ts();
  printf("tpacket_snd: return(%d)\n", $err);
}

Start the SystemTap script with sudo stap call-return.stp and then let's blast 8096 1500 byte frames into that qdisc with a meager 50 frame capacity:

$ sudo ./packet_mmap -c 8096 -s 1500 eth0
[...]
STARTING TEST:
data offset = 32 bytes
start fill() thread
send 8096 packets (+12144000 bytes)
end of task fill()
Loop until queue empty (0)
END (number of error:0)

So let's check how many packets were dropped by the qdisc:

$ tc -s -d qdisc show dev eth0
qdisc pfifo 8004: root refcnt 2 limit 50p
 Sent 25755333 bytes 8606 pkt (dropped 1, overlimits 0 requeues 265) 
 backlog 0b 0p requeues 265

WAT? Dropped one of 8096 frames dumped onto a 50 frame qdisc? Let's check the SystemTap output:

[1492603552.938414] tpacket_snd: args(po=0xffff8801673ba338 msg=0x14)
[1492603553.036601] tpacket_snd: return(12144000)
[1492603553.036706] tpacket_snd: args(po=0x0 msg=0x14)
[1492603553.036716] tpacket_snd: return(0)

WAT? It took nearly 100ms to process 8096 frames in tpacket_snd? Let's check how long that would actually take to transmit; that's 8096 frames at 1500 bytes/frame at 1gigabit/s ~= 97ms. WAT? It smells like something is blocking.

Let's take a closer look at tpacket_snd. Groan:

skb = sock_alloc_send_skb(&po->sk,
                 hlen + tlen + sizeof(struct sockaddr_ll),
                 0, &err);

That 0 looks pretty innocuous, but that is actually the noblock argument. It should be msg->msg_flags & MSG_DONTWAIT (it turns out this is fixed in 4.1). What is happening here is that the size of the qdisc is not the only limiting resource. If allocating space for the skb would exceed the size of the socket's sndbuf limit, then this call will either block to wait for skb's to be freed up or return -EAGAIN to a non-blocking caller. In the fix in V4.1, if non-blocking is requested it will return the number of bytes written if non-zero, otherwise -EAGAIN to the caller, which almost seems like someone doesn't want you to figure out how to use this (e.g. you fill up a tx ring with 80MB of data, call sendto with MSG_DONTWAIT, and you get back a result that you sent 150KB rather than EWOULDBLOCK).

So if you are running a kernel prior to 4.1 (I believe the OP is running >4.1 and is not affected by this bug), you will need to patch af_packet.c and build a new kernel or upgrade to a kernel 4.1 or better.

I have now booted a patched version of my kernel, since the machine I am using is running 3.13. While we won't block if the sndbuf is full, we still will return with -EAGAIN. I made some changes to packet_mmap.c to increase the default size of the sndbuf and to use SO_SNDBUFFORCE to override the system max per socket if necessary (it appears to need about 750 bytes + the frame size for each frame). I also made some additions to call-return.stp to log the sndbuf max size (sk_sndbuf), the amount used (sk_wmem_alloc), any error returned by sock_alloc_send_skb and any error returned from dev_queue_xmit on enqueuing the skb to the qdisc. Here is the new version:

# This is specific to net/packet/af_packet.c 3.13.0-116

function print_ts() {
  ts = gettimeofday_us();
  printf("[%10d.%06d] ", ts/1000000, ts%1000000);
}

# 2088 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
# 2089 {
# [...]
# 2133  if (size_max > dev->mtu + reserve + VLAN_HLEN)
# 2134      size_max = dev->mtu + reserve + VLAN_HLEN;
# 2135 
# 2136  do {
# [...]
# 2148      skb = sock_alloc_send_skb(&po->sk,
# 2149              hlen + tlen + sizeof(struct sockaddr_ll),
# 2150              msg->msg_flags & MSG_DONTWAIT, &err);
# 2151 
# 2152      if (unlikely(skb == NULL))
# 2153          goto out_status;
# [...]
# 2181      err = dev_queue_xmit(skb);
# 2182      if (unlikely(err > 0)) {
# 2183          err = net_xmit_errno(err);
# 2184          if (err && __packet_get_status(po, ph) ==
# 2185                 TP_STATUS_AVAILABLE) {
# 2186              /* skb was destructed already */
# 2187              skb = NULL;
# 2188              goto out_status;
# 2189          }
# 2190          /*
# 2191           * skb was dropped but not destructed yet;
# 2192           * let's treat it like congestion or err < 0
# 2193           */
# 2194          err = 0;
# 2195      }
# 2196      packet_increment_head(&po->tx_ring);
# 2197      len_sum += tp_len;
# 2198  } while (likely((ph != NULL) ||
# 2199          ((!(msg->msg_flags & MSG_DONTWAIT)) &&
# 2200           (atomic_read(&po->tx_ring.pending))))
# 2201      );
# [...]
# 2213  return err;
# 2214 }

probe kernel.function("tpacket_snd") {
  print_ts();
  printf("tpacket_snd: args(%s)\n", $$parms);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2133") {
  print_ts();
  printf("tpacket_snd:2133: sk_sndbuf =  %d sk_wmem_alloc = %d\n", 
     $po->sk->sk_sndbuf, $po->sk->sk_wmem_alloc->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2153") {
  print_ts();
  printf("tpacket_snd:2153: sock_alloc_send_skb err = %d, sk_sndbuf =  %d sk_wmem_alloc = %d\n", 
     $err, $po->sk->sk_sndbuf, $po->sk->sk_wmem_alloc->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2182") {
  if ($err != 0) {
    print_ts();
    printf("tpacket_snd:2182: dev_queue_xmit err = %d\n", $err);
  }
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2187") {
  print_ts();
  printf("tpacket_snd:2187: destructed: net_xmit_errno = %d\n", $err);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2194") {
  print_ts();
  printf("tpacket_snd:2194: *NOT* destructed: net_xmit_errno = %d\n", $err);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2213") {
  print_ts();
  printf("tpacket_snd: return(%d) sk_sndbuf =  %d sk_wmem_alloc = %d\n", 
     $err, $po->sk->sk_sndbuf, $po->sk->sk_wmem_alloc->counter);
}

Let's try again:

$ sudo tc qdisc add dev eth0 root pfifo limit 50
$ tc -s -d qdisc show dev eth0
qdisc pfifo 8001: root refcnt 2 limit 50p
 Sent 2154 bytes 21 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
$ sudo ./packet_mmap -c 200 -s 1500 eth0
[...]
c_sndbuf_sz:       1228800
[...]
STARTING TEST:
data offset = 32 bytes
send buff size = 1228800
got buff size = 425984
buff size smaller than desired, trying to force...
got buff size = 2457600
start fill() thread
send: No buffer space available
end of task fill()
send: No buffer space available
Loop until queue empty (-1)
[repeated another 17 times]
send 3 packets (+4500 bytes)
Loop until queue empty (4500)
Loop until queue empty (0)
END (number of error:0)
$  tc -s -d qdisc show dev eth0
qdisc pfifo 8001: root refcnt 2 limit 50p
 Sent 452850 bytes 335 pkt (dropped 19, overlimits 0 requeues 3) 
 backlog 0b 0p requeues 3

And here is the SystemTap output:

[1492759330.907151] tpacket_snd: args(po=0xffff880393246c38 msg=0x14)
[1492759330.907162] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 1
[1492759330.907491] tpacket_snd:2182: dev_queue_xmit err = 1
[1492759330.907494] tpacket_snd:2187: destructed: net_xmit_errno = -105
[1492759330.907500] tpacket_snd: return(-105) sk_sndbuf =  2457600 sk_wmem_alloc = 218639
[1492759330.907646] tpacket_snd: args(po=0x0 msg=0x14)
[1492759330.907653] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 189337
[1492759330.907688] tpacket_snd:2182: dev_queue_xmit err = 1
[1492759330.907691] tpacket_snd:2187: destructed: net_xmit_errno = -105
[1492759330.907694] tpacket_snd: return(-105) sk_sndbuf =  2457600 sk_wmem_alloc = 189337
[repeated 17 times]
[1492759330.908541] tpacket_snd: args(po=0x0 msg=0x14)
[1492759330.908543] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 189337
[1492759330.908554] tpacket_snd: return(4500) sk_sndbuf =  2457600 sk_wmem_alloc = 196099
[1492759330.908570] tpacket_snd: args(po=0x0 msg=0x14)
[1492759330.908572] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 196099
[1492759330.908576] tpacket_snd: return(0) sk_sndbuf =  2457600 sk_wmem_alloc = 196099

Now things are working as expected; we have fixed a bug causing us to block of the sndbuf limit is exceeded and we have adjusted the sndbuf limit so that it should not be a constraint, and now we see the frames from the tx ring are enqueued onto the qdisc until it is full, at which point we get returned ENOBUFS.

The next problem is now how to efficiently keep publishing to the qdisc to keep the interface busy. Note that the implementation of packet_poll is useless in the case that that we fill up the qdisc and get back ENOBUFS, because it just queries if the head is TP_STATUS_AVAILABLE, which in this case will remain TP_STATUS_SEND_REQUEST until a subsequent call to sendto succeeds in queueing the frame to the qdisc. A simple expediency (updated in packet_mmap.c) is to loop on the sendto until success or an error other than ENOBUFS or EAGAIN.

Anyway, we know way more than enough to answer the OPs question now, even if we don't have a complete solution to efficiently keep the NIC from being starved.

From what we have learned, we know that when OP calls sendto with a tx ring in blocking mode, tpacket_snd will start enqueuing skbs onto the qdisc until the sndbuf limit is exceeded (and the default is generally quite small, about 213K, and further, I discovered that frame data referenced in the shared tx ring is counted towards this) when it will block (while still holding pg_vec_lock). As skb's free up, more frames wil be enqueued, and maybe the sndbuf will be exceeded again and we will block again. Eventually, all the data will have beeen queued to the qdisc but tpacket_snd will continue to block until all of the frames have been transmitted (you can't mark a frame in the tx ring as available until the NIC has received it, as an skb in the driver ring references a frame in the tx ring) while still holding pg_vec_lock. At this point the NIC is starved, and any other socket writers have been blocked by the lock.

On the other hand, when OP publishes a packet at a time, it will be handled by packet_snd which will block if there is no room in the sndbuf and then enqueue the frame onto the qdisc, and immediately return. It does not wait for the frame to be transmitted. As the qdisc is being drained, additional frames can be enqueued. If the publisher can keep up, the NIC will never be starved.

Further, the op is copying into the tx ring for every sendto call and comparing that to passing a fixed frame buffer when not using a tx ring. You won't see a speedup from not copying that way (although that is not the only benefit of using the tx ring).

Thanks for all your help so far, I am now building a kernel with the debugging symbols and an unoptimised version of `af_packet.c`. Whilst we wait, just some food for thought. I am not using the `MSG_DONTWAIT` flag in my application. I am not trying to use a non-blocking call, so the code snippet you have above with `sock_alloc_send_skb(x,x,0,x)` - even though we have found a bug with that third parameter not being passed correctly, it should be zero anyway right? — jwbensley, Apr 19 '17 at 19:25
Also I have a 4.4.x Kernel, but as long as you have >= 3.14 if you look at the code I have put on Github, there is a socket option to bypass the QDISC layer and go strait to transmission. Once I have my debug kernel built I will add the same socket option to the `packet_mmap.c` test program to see its effect: `int bypass = 1;` `int ret = setsockopt(sock_fd, SOL_PACKET, PACKET_QDISC_BYPASS, &bypass, sizeof(bypass));` — jwbensley, Apr 19 '17 at 19:27
My Kernel is still compiling but from a quick scan of `af_packet.c` I am expecting at L2695 `err = po->xmit(skb);` to point to `packet_direct_xmit()` instead of `dev_queue_xmit()`, see L3751 in `af_packet.c`. I am thinking that [this](http://lxr.free-electrons.com/source/net/packet/af_packet.c#L3751) points to [this](http://lxr.free-electrons.com/source/net/packet/af_packet.c#L250) which points to [this](http://lxr.free-electrons.com/source/include/linux/netdevice.h#L3970) which points to [this](http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/igb/igb_main.c#L2143). — jwbensley, Apr 19 '17 at 20:55
@jwbensley _there is a socket option to bypass the QDISC layer and go strait to transmission_ Whether you publish to a qdisc or directly to the NICs ring the next problem is to manage how fast you do so (you have been unexpectedly relying on blocking for this thus far). I would start with a qdsic as it is easy to manage its size (`TXQUEUELEN`) and easy to see statistics such as drops if you are publishing too fast (at least if you are not using the default pfifo_fast). I think the driver ring size is dynamically adaptive, but you may be able to control its size. It will take some research. — JimD., Apr 20 '17 at 03:00
@jwbensley [This](https://www.coverfire.com/articles/queueing-in-the-linux-network-stack/) has some potentially interesting info about the driver queue. — JimD., Apr 20 '17 at 08:25
@jwbensley _even though we have found a bug with that third parameter not being passed correctly, it should be zero anyway right?_ I think you are running a version of the kernel with this fixed, and yes, I now understand that it should be zero if you want to block, because this blocks to wait for free space if the sndbuf limit is exceeded. — JimD., Apr 20 '17 at 09:56
@jwbensley I put a summary of why I think blocking tx ring is slower than blocking without tx ring at the end of this answer. I'm out of space (30k chars). The most efficient way to do this is probably non-blocking (as I had been pursing in the answer). — JimD., Apr 21 '17 at 10:46
Thanks! - I'm sill reading through all this and the links and trying to replicate it on my side etc. to get my head around all this. Systemtap is not working well for me, I have a 4.4.0 kernel so I think some things have changed, your "WOOT" example works but none of the `probe kernel.statement()` calls work. That is probably a separate StackOverflow post in its self, so I am just trying to read all the documentation you have linked and debug as best I can without. I'll came back soon. — jwbensley, Apr 21 '17 at 19:51
@jwbensley the kenel.statement probes are linked to specific line numbers in the source code. Since you have a different version of af_packet.c than I do, you will need to adjust the line numbers, and perhaps the variable names as well. — JimD., Apr 22 '17 at 01:08
D - Yeah I have changed the line numbers but stap still wasn't happy. Maybe I haven't compiled the Kernel correctly, for that af_packet sub-module I compiled it with `-O0` but maybe I should have added `-g` also. Other stap statements are working so I am managing. I have just finished reading all the links you provided and replicating on a virtual-machine etc, so I have given you the bountry as it is very well deserved, I would give more if I could have but 500 is the max. I just need more time now to make some changes to the code and test bassed on your findings. — jwbensley, Apr 24 '17 at 13:00
@jwbensley Thanks for the bounty. There is a lot more to understand if you want to do what I think you want to do. StackOverflow is not necessarily the platform to collaborate to solve this. Although maybe the chatrooms? — JimD., Apr 24 '17 at 14:17
Sorry for the delay, I thought I had marked this answer as correct already! — jwbensley, May 06 '17 at 07:43

Sending data with PACKET_MMAP and PACKET_TX_RING is slower than "normal" (without)

1 Answers1

Linked