100% Kernel CPU kills connections to the server

Question

My server runs Centos 6.9 (64gb Ram), and nginx, the problem is that every 10 minutes there are random 100% kernel cpu spikes in htop, generated by "events/10" and "ksoftirqd/10". I don't know how to find out which exact process is generating this problem.

This is my /proc/interrupts

$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
   0:      77897          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      timer
   1:          2          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
   8:          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      rtc0
   9:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          4          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  56:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      aerdrv
  63:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  64:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  65:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  66: 1426061273          0          0          0          0          0          0          0          0          0          0    1914508          0          0          0          0   PCI-MSI-edge      ahci
  67:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      ahci
  68: 3084636512          0          0          0          0          0          0          0          0          0   10149560          0          0          0          0          0   PCI-MSI-edge      eth0
 NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
 LOC: 1972128636  528409367 3519065090 2616991376 2762882221 3577269786 2407615998 2889069038 1939478243 2270996522 1940319131 2244314760 2033706214 2339089941 2303043400 2629954396   Local timer interrupts
 SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
 PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
 IWI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IRQ work interrupts
 RES: 1349612979 1979915818 1044674069  463586597 1410841781  641984863 3396971132 3062175502 2189512469 2034852778 1686264346 1571882114 1410891335 1330892006 1273321645 1195645068   Rescheduling interrupts
 CAL:    1771384    1771300    1775694    1780259    1778017    1782331    1761855    1755630    1758801    1759472    1770034    1773352    1775468    1779579    1778401    1778652   Function call interrupts
 TLB: 1295395722  623515438  528231713  457109575  438669843  412327240  413878597  392015004  399091958  373918339  391267007  362582716  383312220  348908971  376811042  337426419   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730      77730   Machine check polls
 ERR:          0
 MIS:          0

This is my /proc/cpuinfo

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen 7 PRO 1700X Eight-Core Processor
stepping        : 1
cpu MHz         : 2200.000
cache size      : 512 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb perfctr_l2 arat xsaveopt npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold fsgsbase bmi1 avx2 smep bmi2 rdseed adx
bogomips        : 6786.47
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

Hope you can help me, these spikes make the server really unstable (even with nginx and most of the software turned off) I also already tried installing irqbalance but it just switched the cpu that was going 100% from the first to the eleventh. I also made the host switch my drives to another machine of the same architecture, but that didn't work either.

I already did that, temperatures and stuff are ok, I also made the host put the drives in another server of the same kind without any difference. I just wanna know which steps could I take to know EXACTLY which process/software is making this happen. — SensitiveGuy, Jun 11 '18 at 20:19
I already know that ksoftirqd is causing the kernel cpu spike (checked with htop, top, ps aux) but I don't know which software or process or problem is making it spike so badly. It just started to happen a few days ago. I did not touch anything in the centos system files and did not update any software. I don't really know how to debug this. — SensitiveGuy, Jun 11 '18 at 20:30
I see `ahci` and `eth0` in your `/proc/interrupts` have fairly high numbers on CPU 0. That could indicate your network usage is high(er than normal). To debug that, you might want to monitor network usage and see if you can figure out what process is periodically making so much use of the network. — , Jun 11 '18 at 20:36
Do you happen to know any command that would help me monitor network usage (related to each process)? — SensitiveGuy, Jun 11 '18 at 20:37
I just searched for it: [`iftop` and `netstat`](https://askubuntu.com/a/2417) — , Jun 11 '18 at 20:38
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See [What topics can I ask about here](http://stackoverflow.com/help/on-topic) in the Help Center. Perhaps [Super User](http://superuser.com/) or [Unix & Linux Stack Exchange](http://unix.stackexchange.com/) would be a better place to ask. — jww, Jun 11 '18 at 21:34

100% Kernel CPU kills connections to the server

0 Answers0