3

I am using DPDK 21.11 for my application. After a certain time, the the API rte_eth_tx_burst stops sending any packets out.

Ethernet Controller X710 for 10GbE SFP+ 1572 drv=vfio-pci

MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3


 do
            {
                num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);
                pkt_count -= num_sent_pkt;
                retry_count++;
            } while(pkt_count && (retry_count != MAX_RETRY_COUNT_RTE_ETH_TX_BURST));

To debug, I tried to use telemetry to print out the xstats. However, i do not see any errors.

--> /ethdev/xstats,1
{"/ethdev/xstats": {"rx_good_packets": 97727, "tx_good_packets": 157902622, "rx_good_bytes": 6459916, "tx_good_bytes": 229590348448, "rx_missed_errors": 0, "rx_errors": 0, "tx_errors": 0, "rx_mbuf_allocation_errors": 0, "rx_unicast_packets": 95827, "rx_multicast_packets": 1901, "rx_broadcast_packets": 0, "rx_dropped_packets": 0, "rx_unknown_protocol_packets": 97728, "rx_size_error_packets": 0, "tx_unicast_packets": 157902621, "tx_multicast_packets": 0, "tx_broadcast_packets": 1, "tx_dropped_packets": 0, "tx_link_down_dropped": 0, "rx_crc_errors": 0, "rx_illegal_byte_errors": 0, "rx_error_bytes": 0, "mac_local_errors": 0, "mac_remote_errors": 0, "rx_length_errors": 0, "tx_xon_packets": 0, "rx_xon_packets": 0, "tx_xoff_packets": 0, "rx_xoff_packets": 0, "rx_size_64_packets": 967, "rx_size_65_to_127_packets": 96697, "rx_size_128_to_255_packets": 0, "rx_size_256_to_511_packets": 64, "rx_size_512_to_1023_packets": 0, "rx_size_1024_to_1522_packets": 0, "rx_size_1523_to_max_packets": 0, "rx_undersized_errors": 0, "rx_oversize_errors": 0, "rx_mac_short_dropped": 0, "rx_fragmented_errors": 0, "rx_jabber_errors": 0, "tx_size_64_packets": 0, "tx_size_65_to_127_packets": 46, "tx_size_128_to_255_packets": 0, "tx_size_256_to_511_packets": 0, "tx_size_512_to_1023_packets": 0, "tx_size_1024_to_1522_packets": 157902576, "tx_size_1523_to_max_packets": 0, "rx_flow_director_atr_match_packets": 0, "rx_flow_director_sb_match_packets": 13, "tx_low_power_idle_status": 0, "rx_low_power_idle_status": 0, "tx_low_power_idle_count": 0, "rx_low_power_idle_count": 0, "rx_priority0_xon_packets": 0, "rx_priority1_xon_packets": 0, "rx_priority2_xon_packets": 0, "rx_priority3_xon_packets": 0, "rx_priority4_xon_packets": 0, "rx_priority5_xon_packets": 0, "rx_priority6_xon_packets": 0, "rx_priority7_xon_packets": 0, "rx_priority0_xoff_packets": 0, "rx_priority1_xoff_packets": 0, "rx_priority2_xoff_packets": 0, "rx_priority3_xoff_packets": 0, "rx_priority4_xoff_packets": 0, "rx_priority5_xoff_packets": 0, "rx_priority6_xoff_packets": 0, "rx_priority7_xoff_packets": 0, "tx_priority0_xon_packets": 0, "tx_priority1_xon_packets": 0, "tx_priority2_xon_packets": 0, "tx_priority3_xon_packets": 0, "tx_priority4_xon_packets": 0, "tx_priority5_xon_packets": 0, "tx_priority6_xon_packets": 0, "tx_priority7_xon_packets": 0, "tx_priority0_xoff_packets": 0, "tx_priority1_xoff_packets": 0, "tx_priority2_xoff_packets": 0, "tx_priority3_xoff_packets": 0, "tx_priority4_xoff_packets": 0, "tx_priority5_xoff_packets": 0, "tx_priority6_xoff_packets": 0, "tx_priority7_xoff_packets": 0, "tx_priority0_xon_to_xoff_packets": 0, "tx_priority1_xon_to_xoff_packets": 0, "tx_priority2_xon_to_xoff_packets": 0, "tx_priority3_xon_to_xoff_packets": 0, "tx_priority4_xon_to_xoff_packets": 0, "tx_priority5_xon_to_xoff_packets": 0, "tx_priority6_xon_to_xoff_packets": 0, "tx_priority7_xon_to_xoff_packets": 0}}

I have RX-DESC = 128 and TX-DESC = 512 configured.

I am assuming there is some desc leak, is there a way to know if the drop is due to no-desc present? Which counter should I check for that"?

[More Info] Debugging refcnt lead to a deadend. Following the code, it seems that the NIC card does not set the DONE status on the descriptor. When rte_eth_tx_burst is called, the next func internally calls i40e_xmit_pkts -> i40e_xmit_cleanup

When the issue occurs, the following condition fails leading to NIC failure in sending packets out.

    if ((txd[desc_to_clean_to].cmd_type_offset_bsz &
            rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
            rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
        PMD_TX_LOG(DEBUG, "TX descriptor %4u is not done "
               "(port=%d queue=%d)", desc_to_clean_to,
               txq->port_id, txq->queue_id);
        return -1;
    }

If I comment out the "return -1" (ofcourse not the fix and will lead to other issues) ..but I can see that traffic is stable for a long long time. I tracked all the mbuf from start of traffic till issue is hit,there is no issue seen atleast in mbuf that I could see.

I40E_TX_DESC_DTYPE_DESC_DONE will be set in h/w for the descriptor. Is there any way I can see that code? Is it part of x710 driver code?

I still doubt my own code since the issue is present even after NIC card is replaced. However, how can my code effect NIC card not modifying the DONE status of descriptor? Any suggestions would really be helpful.

[UPDATE] Found out that 2 cores were using the same TX queueID to send packets.

  1. Data processing and TX core
  2. ARP req/response by Data RX core

This lead to some potential corruption ? Found some info on this: http://mails.dpdk.org/archives/dev/2014-January/001077.html

After creating separate queue for ARP messages, issue is not seen anymore/yet for 2+hours

nmurshed
  • 77
  • 6

1 Answers1

2

[EDIT-2] the error is narrowed down to multiple threads are using same portid-queueid pair which causes the stalls in the NIC from XMIT. Earlier the debug was not focusing on slow path (ARP reply) hence this was missed out.

[Edit-1] based on the limited debug opportunities and updates from the message, the updates are

  1. The internal TX code updates refcnt by 2 (that is refcnt is 3).
  2. Once the reply is received the refcnt is decremented by 2
  3. Corner cases are now addressed for mbuf_free
  4. Tested on RHEL and Centos both has issues, hence it is software and not os
  5. updated the NIC firmware, now all platforms consistently shows error after a couple of hours of the run.

Note:

  1. hence all pointers lead to code and corner case handling gaps since testpmd|l2fwd|l3fwd does not show the case the error with DPDK library or platform.
  2. Since the code base is not shared, only option is to rely on updates.

hence after extensive debugging and analysis, the root cause of the issue is not DPDK, NIC or platform but GAP in the code being used.

If the code's intent is to try within MAX_RETRY_COUNT_RTE_ETH_TX_BURST for all packets of pkt_count, the current code snippet needs a few corrections. Let me explain

  1. mbuf is the array of valid packets to be TX
  2. mbuf_idx represents the current index to be sent for TX
  3. pkt_count represents the number of packets sent out in the current attempt.
  4. num_sent_pkt represents actual packets sent for DMA copy to NIC (physical).
  5. retry_count is the local variable keeping count of retries.

there are 2 corner cases to be taken care of (not shared in the current snippet)

  1. If MAX_RETRY_COUNT_RTE_ETH_TX_BURST is exceeded and num_sent_pkt is not equal to actual TX, at end of the while loop one needs to free up the non-transmitted MBUF.
  2. If there are any MBUF with ref_cnt greater than 1 (especially with multicast or broadcast or packet duplication) one needs a mechanism free those too.

A possible code snippet could be:

MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3
retry_count = 0;
mbuf_idx = 0;
pkt_count = try_sent; /* try_sent intended send*/

/* if there are any mbuf with ref_cnt > 1, we need separate logic to handle those */

do {
  num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);

  pkt_count -= num_sent_pkt;
  mbuf_idx += num_sent_pkt;
  
  retry_count++;
} while((pkt_count) && (retry_count < MAX_RETRY_COUNT_RTE_ETH_TX_BURST));

/* to prevent the leak for unsent packet*/
if (pkt_count) {
    rte_pktmbuf_free_bulk(&mbuf[mbuf_idx], pkt_count);
}

note: the easiest way to identify mbuf leak is to run DPDK secondary process proc-info to check for mbuf free count.

[EDIT-1] based on the debug, it has been identified that the recent is indeed greater than 1. Accumulating such corner cases lead to mempool depletion.

logs:

dump mbuf at 0x2b67803c0, iova=0x2b6780440, buf_len=9344
pkt_len=1454, ol_flags=0x180, nb_segs=1, port=0, ptype=0x291
segment at 0x2b67803c0, data=0x2b67804b8, len=1454, off=120, refcnt=3
Vipin Varghese
  • 4,540
  • 2
  • 9
  • 25
  • Hi Vipin, Thanks for the reply. I have the code to free as you mentioned. I had just pasted the sent part of the code. I will check with proc-info about the mbuf count. – nmurshed Jun 07 '22 at 13:11
  • Question : Reading online, I read that rte_eth_tx_burst will return 0 if there are no free Tx-desc. That is not same as Mbuf..correct? – nmurshed Jun 07 '22 at 13:14
  • @nmurshed if you are asking about TX DMA descriptor, yes technically it is correct for Physical PMD. But in case of Virtual PMD that is not true. – Vipin Varghese Jun 07 '22 at 13:35
  • ` I have the code to free as you mentioned. I had just pasted the sent part of the code.` Looks like you are leaking mbuf, can have quick live debug (if you are available) – Vipin Varghese Jun 07 '22 at 13:36
  • Hello Vipin, I will try to setup the issue and ping you and we can check. Till then, I will try to check the output of dpdk-procinfo – nmurshed Jun 07 '22 at 13:52
  • /* if there are any mbuf with ref_cnt > 1, we need separate logic to handle those */ == > What logic is needed here? change rfcnt = 1 ? – nmurshed Jun 07 '22 at 18:22
  • Hi @Vipin, Question on rte_pktmbuf_free, As per API doc, "Free an mbuf, and all its segments in case of chained buffers. Each segment is added back into its original mempool." It does not mention anything about rfcnt. Is it how its implemented internally that it will not free and just reduce rfcnt by 1 ? – nmurshed Jun 08 '22 at 12:49
  • Niyaz @nmurshed as per the debug log you shared you have only 1 segment for your mbuf. it is the refcnt which is 3. So please do not confuse refcnt with nb_segs – Vipin Varghese Jun 08 '22 at 14:30
  • Based on my tests.. its exactly how you said.. calling rte_pktmbuf_free just reduces the rfcnt – nmurshed Jun 08 '22 at 14:43