I am using DPDK 21.11 for my application. After a certain time, the the API rte_eth_tx_burst stops sending any packets out.
Ethernet Controller X710 for 10GbE SFP+ 1572 drv=vfio-pci
MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3
do
{
num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);
pkt_count -= num_sent_pkt;
retry_count++;
} while(pkt_count && (retry_count != MAX_RETRY_COUNT_RTE_ETH_TX_BURST));
To debug, I tried to use telemetry to print out the xstats. However, i do not see any errors.
--> /ethdev/xstats,1
{"/ethdev/xstats": {"rx_good_packets": 97727, "tx_good_packets": 157902622, "rx_good_bytes": 6459916, "tx_good_bytes": 229590348448, "rx_missed_errors": 0, "rx_errors": 0, "tx_errors": 0, "rx_mbuf_allocation_errors": 0, "rx_unicast_packets": 95827, "rx_multicast_packets": 1901, "rx_broadcast_packets": 0, "rx_dropped_packets": 0, "rx_unknown_protocol_packets": 97728, "rx_size_error_packets": 0, "tx_unicast_packets": 157902621, "tx_multicast_packets": 0, "tx_broadcast_packets": 1, "tx_dropped_packets": 0, "tx_link_down_dropped": 0, "rx_crc_errors": 0, "rx_illegal_byte_errors": 0, "rx_error_bytes": 0, "mac_local_errors": 0, "mac_remote_errors": 0, "rx_length_errors": 0, "tx_xon_packets": 0, "rx_xon_packets": 0, "tx_xoff_packets": 0, "rx_xoff_packets": 0, "rx_size_64_packets": 967, "rx_size_65_to_127_packets": 96697, "rx_size_128_to_255_packets": 0, "rx_size_256_to_511_packets": 64, "rx_size_512_to_1023_packets": 0, "rx_size_1024_to_1522_packets": 0, "rx_size_1523_to_max_packets": 0, "rx_undersized_errors": 0, "rx_oversize_errors": 0, "rx_mac_short_dropped": 0, "rx_fragmented_errors": 0, "rx_jabber_errors": 0, "tx_size_64_packets": 0, "tx_size_65_to_127_packets": 46, "tx_size_128_to_255_packets": 0, "tx_size_256_to_511_packets": 0, "tx_size_512_to_1023_packets": 0, "tx_size_1024_to_1522_packets": 157902576, "tx_size_1523_to_max_packets": 0, "rx_flow_director_atr_match_packets": 0, "rx_flow_director_sb_match_packets": 13, "tx_low_power_idle_status": 0, "rx_low_power_idle_status": 0, "tx_low_power_idle_count": 0, "rx_low_power_idle_count": 0, "rx_priority0_xon_packets": 0, "rx_priority1_xon_packets": 0, "rx_priority2_xon_packets": 0, "rx_priority3_xon_packets": 0, "rx_priority4_xon_packets": 0, "rx_priority5_xon_packets": 0, "rx_priority6_xon_packets": 0, "rx_priority7_xon_packets": 0, "rx_priority0_xoff_packets": 0, "rx_priority1_xoff_packets": 0, "rx_priority2_xoff_packets": 0, "rx_priority3_xoff_packets": 0, "rx_priority4_xoff_packets": 0, "rx_priority5_xoff_packets": 0, "rx_priority6_xoff_packets": 0, "rx_priority7_xoff_packets": 0, "tx_priority0_xon_packets": 0, "tx_priority1_xon_packets": 0, "tx_priority2_xon_packets": 0, "tx_priority3_xon_packets": 0, "tx_priority4_xon_packets": 0, "tx_priority5_xon_packets": 0, "tx_priority6_xon_packets": 0, "tx_priority7_xon_packets": 0, "tx_priority0_xoff_packets": 0, "tx_priority1_xoff_packets": 0, "tx_priority2_xoff_packets": 0, "tx_priority3_xoff_packets": 0, "tx_priority4_xoff_packets": 0, "tx_priority5_xoff_packets": 0, "tx_priority6_xoff_packets": 0, "tx_priority7_xoff_packets": 0, "tx_priority0_xon_to_xoff_packets": 0, "tx_priority1_xon_to_xoff_packets": 0, "tx_priority2_xon_to_xoff_packets": 0, "tx_priority3_xon_to_xoff_packets": 0, "tx_priority4_xon_to_xoff_packets": 0, "tx_priority5_xon_to_xoff_packets": 0, "tx_priority6_xon_to_xoff_packets": 0, "tx_priority7_xon_to_xoff_packets": 0}}
I have RX-DESC = 128 and TX-DESC = 512 configured.
I am assuming there is some desc leak, is there a way to know if the drop is due to no-desc present? Which counter should I check for that"?
[More Info] Debugging refcnt lead to a deadend. Following the code, it seems that the NIC card does not set the DONE status on the descriptor. When rte_eth_tx_burst is called, the next func internally calls i40e_xmit_pkts -> i40e_xmit_cleanup
When the issue occurs, the following condition fails leading to NIC failure in sending packets out.
if ((txd[desc_to_clean_to].cmd_type_offset_bsz &
rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
PMD_TX_LOG(DEBUG, "TX descriptor %4u is not done "
"(port=%d queue=%d)", desc_to_clean_to,
txq->port_id, txq->queue_id);
return -1;
}
If I comment out the "return -1" (ofcourse not the fix and will lead to other issues) ..but I can see that traffic is stable for a long long time. I tracked all the mbuf from start of traffic till issue is hit,there is no issue seen atleast in mbuf that I could see.
I40E_TX_DESC_DTYPE_DESC_DONE will be set in h/w for the descriptor. Is there any way I can see that code? Is it part of x710 driver code?
I still doubt my own code since the issue is present even after NIC card is replaced. However, how can my code effect NIC card not modifying the DONE status of descriptor? Any suggestions would really be helpful.
[UPDATE] Found out that 2 cores were using the same TX queueID to send packets.
- Data processing and TX core
- ARP req/response by Data RX core
This lead to some potential corruption ? Found some info on this: http://mails.dpdk.org/archives/dev/2014-January/001077.html
After creating separate queue for ARP messages, issue is not seen anymore/yet for 2+hours