I am seeing the following kernel bug when EPT is enabled for a nested VM implementation. Hoping you can guide me as to the root cause and remedy.
The background is I have a nested VM where the outer VM is based on an older version of Linux, from around 5 years ago (4.8 based). Due to the number of recipes and time cost of testing/integration etc. really can not upgrade this kernel at this point, looking for a pin point fix. The inner VM is started via QEMU/KVM by the outer VM.
The backtrace is shown at the end.
**I can send the full vmx.c if required. The key portion is: **
if (is_page_fault(intr_info)) {
/* EPT won't cause page fault directly */
BUG_ON(enable_ept); /* line 5862 */
cr2 = vmcs_readl(EXIT_QUALIFICATION);
trace_kvm_page_fault(cr2, error_code);
vcpu->arch.l1tf_flush_l1d = true;
if (kvm_event_needs_reinjection(vcpu))
kvm_mmu_unprotect_page_virt(vcpu, cr2);
return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
}
The git commit logs for my version of vmx.c show:
git log --pretty='format:%H%n%ae%n%ad%n%s' vmx.c
2b79eb36303881b62e8c648e25ab03ba0aba3315
jschoenh@amazon.de
Thu Sep 7 19:02:30 2017 +0100
KVM: VMX: Do not BUG() on out-of-bounds guest IRQ
87b0dfdd7ecc12d21ff81c99fa1a3acb7caff8c9
jpoimboe@redhat.com
Tue Aug 14 12:32:19 2018 -0500
x86/kvm/vmx: Remove duplicate l1d flush definitions
6982d55e2102023a662e8ed93d70eca73216584b
pbonzini@redhat.com
Sun Aug 5 16:07:46 2018 +0200
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
97897decd2233a735e2a63941fb9c0305e8fe737
nstange@suse.de
Sun Jul 22 13:38:18 2018 +0200
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
bc47bdd994ad92ba2544b4ec57ca637a91943886
nstange@suse.de
Fri Jul 27 13:22:16 2018 +0200
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
4ab97c1f0aeeb4b1a176461aff30855574d7c99c
nstange@suse.de
Sat Jul 21 22:35:28 2018 +0200
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
d6907437383f5e603e31dd502601dca6cd2c5caa
nstange@suse.de
Sat Jul 21 22:25:00 2018 +0200
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
3b17c7b16c96d8851cf25b0e9cc073ad38cf4d1c
nstange@suse.de
**The kernel Backtrace: **
[173167.268468] ------------[ cut here ]------------
[173167.272830] kernel BUG at ./qemux86-64/kernel-source/arch/x86/kvm/vmx.c:5862!
[173167.276623] invalid opcode: 0000 [#1] PREEMPT SMP
[173167.277735] Modules linked in: igb_uio(O) uio tun bridge stp llc kvm_intel kvm irqbypass parport_pc floppy parport uvesafb
[173167.280491] CPU: 1 PID: 1236 Comm: qemu-system-x86 Tainted: G O 4.8.28-WR9.0.0.20_standard #1
[173167.282773] Hardware name: QEMU VM, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[173167.285001] task: ffff88017428e800 task.stack: ffffc90003c44000
[173167.286323] RIP: 0010:[<ffffffffa017c26f>] [<ffffffffa017c26f>] handle_exception+0x3ef/0x440 [kvm_intel]
[173167.288680] RSP: 0018:ffffc90003c47c78 EFLAGS: 00010202
[173167.289921] RAX: 0000000000000000 RBX: ffff8801741b8000 RCX: 0000000080000b0e
[173167.291888] RDX: 0000000000000000 RSI: 000000008000030e RDI: ffff8801741b8000
[173167.293858] RBP: ffffc90003c47c98 R08: 0000000080000300 R09: ffffffff806f9d49
[173167.295845] R10: ffffffff80ad7440 R11: 0000000000000000 R12: 0000000000000000
[173167.297835] R13: 0000000000000000 R14: ffff880175904000 R15: 0000000000000000
[173167.299836] FS: 00007f5fb9882700(0000) GS:ffff88017fc80000(0000) knlGS:0000000000000000
[173167.301958] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[173167.303279] CR2: 00000000006db000 CR3: 000000017415e000 CR4: 0000000000142670
[173167.305262] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[173167.307263] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[173167.309272] Stack:
[173167.310063] ffff8801741b8000 0000000000000000 0000000000000000 0000000000000001
[173167.312196] ffffc90003c47d30 ffffffffa0184a47 ffffffffa018855d ffffffffa0188569
[173167.314319] ffffffffa018855d ffffffffa0188569 ffffffffa018855d ffffffffa0188569
[173167.316879] Call Trace:
[173167.317740] [<ffffffffa0184a47>] vmx_handle_exit+0x147/0x1450 [kvm_intel]
[173167.319230] [<ffffffffa018855d>] ? vmx_vcpu_run+0x29d/0x520 [kvm_intel]
[173167.320672] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.322110] [<ffffffffa018855d>] ? vmx_vcpu_run+0x29d/0x520 [kvm_intel]
[173167.323567] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.325006] [<ffffffffa018855d>] ? vmx_vcpu_run+0x29d/0x520 [kvm_intel]
[173167.326382] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.327764] [<ffffffff81477613>] ? __this_cpu_preempt_check+0x13/0x20
[173167.329159] [<ffffffffa00593f1>] kvm_arch_vcpu_ioctl_run+0x981/0x1790 [kvm]
[173167.330606] [<ffffffffa017bce8>] ? vmx_vcpu_load+0x1b8/0x250 [kvm_intel]
[173167.331991] [<ffffffffa003d441>] kvm_vcpu_ioctl+0x2e1/0x590 [kvm]
[173167.333299] [<ffffffff81a56921>] ? _raw_spin_unlock_irq+0x21/0x40
[173167.334573] [<ffffffff812205bc>] ? eventfd_write+0xcc/0x220
[173167.335766] [<ffffffff811e9c77>] do_vfs_ioctl+0x97/0x5d0
[173167.336901] [<ffffffff811f4aef>] ? __fget+0x7f/0xb0
[173167.337971] [<ffffffff811ea229>] SyS_ioctl+0x79/0x90
[173167.339066] [<ffffffff81003ac8>] do_syscall_64+0x68/0xe0
[173167.340183] [<ffffffff81a56f4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[173167.341479] Code: d0 6d e8 e0 e9 56 fd ff ff ba 0c 44 00 00 0f 78 d0 88 83 f8 19 00 00 48 8b 43 38 a9 00 00 01 00 0f 84 95 fe
ff ff e9 a6 fc ff ff <0f> 0b 48 89 df e8 07 cb ff ff e9 3c ff ff ff 31 d2 45 31 c0 31
[173167.346510] RIP [<ffffffffa017c26f>] handle_exception+0x3ef/0x440 [kvm_intel]
[173167.348248] RSP <ffffc90003c47c78>
[173167.349107] ------------[ cut here ]------------
[173167.350104] WARNING: CPU: 1 PID: 1236 at ./build/tmp/work-shared/qemux86-64/kernel-source/kernel/softirq.c:150 __local_bh_enable_ip+0x75/0xa0
[173167.354437] Modules linked in: igb_uio(O) uio tun bridge stp llc kvm_intel kvm irqbypass parport_pc floppy parport uvesafb
[173167.357020] CPU: 1 PID: 1236 Comm: qemu-system-x86 Tainted: G O 4.8.28-WR9.0.0.20_standard #1
[173167.359139] Hardware name: QEMU VM, BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[173167.361224] 0000000000000000 ffffc90003c47770 ffffffff8145ac22 0000000000000000
[173167.363215] 0000000000000000 ffffc90003c477b0 ffffffff8107496b 0000009600000500
[173167.365201] 0000000000000201 000000000000003c 00000000024000c0 ffff88004feab8a0
[173167.367223] Call Trace:
[173167.368030] [<ffffffff8145ac22>] dump_stack+0x65/0x83
[173167.369214] [<ffffffff8107496b>] __warn+0xcb/0xf0
[173167.370345] [<ffffffff81074a5d>] warn_slowpath_null+0x1d/0x20
[173167.371630] [<ffffffff81079da5>] __local_bh_enable_ip+0x75/0xa0
[173167.372935] [<ffffffff81a56abe>] _raw_spin_unlock_bh+0x1e/0x20
[173167.374232] [<ffffffff8165de24>] cn_netlink_send_mult+0x164/0x1f0
[173167.375560] [<ffffffff8165decb>] cn_netlink_send+0x1b/0x20
[173167.376795] [<ffffffffa0000734>] uvesafb_exec+0x164/0x330 [uvesafb]
[173167.378137] [<ffffffff811b9e03>] ? kmem_cache_alloc_trace+0x173/0x1c0
[173167.379494] [<ffffffffa0000a3e>] uvesafb_blank+0x13e/0x180 [uvesafb]
[173167.380825] [<ffffffff814ba4fa>] fb_blank+0x5a/0xb0
[173167.381955] [<ffffffff814b4e27>] fbcon_blank+0x127/0x380
[173167.383145] [<ffffffff814775f7>] ? debug_smp_processor_id+0x17/0x20
[173167.384455] [<ffffffff8109be52>] ? get_nohz_timer_target+0x22/0x110
[173167.385757] [<ffffffff81099eb8>] ? preempt_count_add+0xa8/0xc0
[173167.387000] [<ffffffff810d3d1f>] ? __internal_add_timer+0x1f/0x60
[173167.388266] [<ffffffff81a568e4>] ? _raw_spin_unlock_irqrestore+0x24/0x40
[173167.389605] [<ffffffff810d6156>] ? mod_timer+0x196/0x310
[173167.390754] [<ffffffff81537a90>] do_unblank_screen+0xd0/0x1a0
[173167.391940] [<ffffffff81537b70>] unblank_screen+0x10/0x20
[173167.393075] [<ffffffff8146a625>] bust_spinlocks+0x15/0x30
[173167.394205] [<ffffffff81031805>] oops_end+0x35/0xd0
[173167.395252] [<ffffffff81031d6b>] die+0x4b/0x70
[173167.396214] [<ffffffff8102ed72>] do_trap+0xb2/0x140
[173167.397234] [<ffffffff8102efc7>] do_error_trap+0x77/0xe0
[173167.398314] [<ffffffffa017c26f>] ? handle_exception+0x3ef/0x440 [kvm_intel]
[173167.399626] [<ffffffff810d6f59>] ? hrtimer_cancel+0x19/0x20
[173167.400737] [<ffffffffa0075021>] ? start_hv_tscdeadline+0xe1/0x110 [kvm]
[173167.401991] [<ffffffffa0075336>] ? start_apic_timer+0x96/0x100 [kvm]
[173167.403203] [<ffffffff8102f640>] do_invalid_op+0x20/0x30
[173167.404269] [<ffffffff81a57c9e>] invalid_op+0x1e/0x30
[173167.405295] [<ffffffffa017c26f>] ? handle_exception+0x3ef/0x440 [kvm_intel]
[173167.406587] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.407850] [<ffffffffa0184a47>] vmx_handle_exit+0x147/0x1450 [kvm_intel]
[173167.409122] [<ffffffffa018855d>] ? vmx_vcpu_run+0x29d/0x520 [kvm_intel]
[173167.410382] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.411638] [<ffffffffa018855d>] ? vmx_vcpu_run+0x29d/0x520 [kvm_intel]
[173167.412881] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.414116] [<ffffffffa018855d>] ? vmx_vcpu_run+0x29d/0x520 [kvm_intel]
[173167.415355] [<ffffffffa0188569>] ? vmx_vcpu_run+0x2a9/0x520 [kvm_intel]
[173167.416572] [<ffffffff81477613>] ? __this_cpu_preempt_check+0x13/0x20
[173167.417782] [<ffffffffa00593f1>] kvm_arch_vcpu_ioctl_run+0x981/0x1790 [kvm]
[173167.419064] [<ffffffffa017bce8>] ? vmx_vcpu_load+0x1b8/0x250 [kvm_intel]
[173167.420308] [<ffffffffa003d441>] kvm_vcpu_ioctl+0x2e1/0x590 [kvm]
[173167.421465] [<ffffffff81a56921>] ? _raw_spin_unlock_irq+0x21/0x40
[173167.422617] [<ffffffff812205bc>] ? eventfd_write+0xcc/0x220
[173167.423710] [<ffffffff811e9c77>] do_vfs_ioctl+0x97/0x5d0
[173167.424755] [<ffffffff811f4aef>] ? __fget+0x7f/0xb0
[173167.425734] [<ffffffff811ea229>] SyS_ioctl+0x79/0x90
[173167.426740] [<ffffffff81003ac8>] do_syscall_64+0x68/0xe0
[173167.427778] [<ffffffff81a56f4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
[173167.429014] ---[ end trace 880155778bc81f72 ]---
[173167.431731] ---[ end trace 880155778bc81f73 ]---`
Brought up the nested VM described in the problem details section. Let it sit for a less than a day without any action, the issue happened.