0

I'm trying to setup a cluster with Openstack. I've previously deployed Train on this cluster and it worked fine for me. I'm trying to install the latest version now, Antelope, with MAAS and a Juju bundle according to the directions provided here. (I've also tried the individual charm deployment guide with the same issues.)

I've provisioned/configured my machines with MAAS (commissioned machines, setup br-ex as an Open VSwitch Bridge on all nodes, configured partitions for ceph-osd) bootstrapped juju, setup Vault/Placement and juju is reporting everything is OK:

enter image description here

Accessing Openstack via the CLI or Horizon is fine, and I've setup an image, security group, flavor, public network, subnets, etc. Creating an instances reports no errors in Openstack, but the instance fails to start. (I've tried using both a cirros image and a Ubuntu Jammy image.) From the console, I see this error:

enter image description here

I've checked /var/log/nova on the hypervisor, and it shows no errors. I've also checked out glance, keystone, ceph and cinder logs on various machines and I'm not seeing anything that looks even possibly related. I've re-downloaded the images from Openstack and they match the images I'm uploading, and verified that my download of these images from the official repos match.

Where else can I check for errors, or what additional information does anyone need to help debug what's going on? Thanks!

UPDATE After logging into the hypervisor and inspecting the images files used for starting the instance, it appears that nova has not downloaded the image file for the instance properly for some reason. Though, it's not showing any errors in the log. I checked this by mounting the base image file, and it pretty much just has boot information:

root@shen35:/var/lib/nova/instances/_base# file 615313348ae2e8ff2099cc01b35e30cf6e754d3d
jdh_img_test: DOS/MBR boot sector; GRand Unified Bootloader, stage1 version 0x3, 1st sector stage2 0x10c22, extended partition table (last)

root@shen35:/var/lib/nova/instances/_base# fdisk -l 615313348ae2e8ff2099cc01b35e30cf6e754d3d
Disk 615313348ae2e8ff2099cc01b35e30cf6e754d3d: 112 MiB, 117440512 bytes, 229376 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F05EDE64-4EFD-477A-B01E-B37CCD5D3EB4

Mounting the two partitions in there shows me normal looking grub files:

ls -R jdh_mount
jdh_mount:
EFI

jdh_mount/EFI:
BOOT  ubuntu

jdh_mount/EFI/BOOT:
bootx64.efi

jdh_mount/EFI/ubuntu:
grub.cfg
root@shen35:/var/lib/nova/instances/_base# fdisk -l jdh_img_test
Disk jdh_img_test: 112 MiB, 117440512 bytes, 229376 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F05EDE64-4EFD-477A-B01E-B37CCD5D3EB4

Device         Start    End Sectors  Size Type
jdh_img_test1  18432 229342  210911  103M Linux filesystem
jdh_img_test15  2048  18431   16384    8M EFI System

Partition table entries are not in disk order.



root@shen35:/var/lib/nova/instances/_base# ls -R
.:
615313348ae2e8ff2099cc01b35e30cf6e754d3d  ephemeral_20_40d1d2c  jdh_img_test  jdh_mount  jdh_mount_linux

./jdh_mount:

./jdh_mount_linux:
boot  initrd.img  lost+found  vmlinuz

./jdh_mount_linux/boot:
config-5.3.0-26-generic  grub  initrd.img-5.3.0-26-generic  vmlinuz-5.3.0-26-generic

./jdh_mount_linux/boot/grub:
e2fs_stage1_5  menu.lst  stage1  stage2

./jdh_mount_linux/lost+found:

I am surprised that there's apparently no files for an operating system in there, just boot files. I checked the other base file, which is the ephemeral disk attached to the image. It's fine, but also completely empty. So, I think I can confirm that the reason the instance isn't booting is that none of the image files actually contain an OS, despite having grub setup OK. So, why isn't nova-compute getting the images right?

Update 2: I retried the previous experiment with Ubuntu Jammy and it seems that nova is in fact downloading the image OK. The small partitions in the cirros OS match the image I downloaded from before sending it to openstack. But, Jammy won't boot either:

enter image description here

So, now I think the images are downloaded properly, but QEMU isn't running them properly. Checking how qemu is run with ps, I'm seeing this nightmare of a command:

/usr/bin/qemu-system-x86_64 -name guest=instance-00000004,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-4-instance-00000004/master-key.aes"} -machine pc-i440fx-6.2,usb=off,dump-guest-core=off,memory-backend=pc.ram -accel kvm -cpu Broadwell-IBRS,vme=on,ss=on,vmx=on,pdcm=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc-adjust=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaveopt=on,pdpe1gb=on,abm=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,skip-l1dfl-vmentry=on,pschange-mc-no=on -m 2 -object {"qom-type":"memory-backend-ram","id":"pc.ram","size":2097152} -overcommit mem-lock=off -smp 1,sockets=1,dies=1,cores=1,threads=1 -uuid d4920f5a-0491-4e1f-b1fc-875660d11eda -smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack Nova,version=27.0.0,serial=d4920f5a-0491-4e1f-b1fc-875660d11eda,uuid=d4920f5a-0491-4e1f-b1fc-875660d11eda,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=39,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -blockdev {"driver":"file","filename":"/var/lib/nova/instances/_base/56f431310d4bce927e45584a70e34f29141f40af","node-name":"libvirt-4-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-4-format","read-only":true,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-4-storage"} -blockdev {"driver":"file","filename":"/var/lib/nova/instances/d4920f5a-0491-4e1f-b1fc-875660d11eda/disk","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-2-storage","backing":"libvirt-4-format"} -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on -blockdev {"driver":"file","filename":"/var/lib/nova/instances/_base/ephemeral_20_40d1d2c","node-name":"libvirt-3-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-3-format","read-only":true,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-3-storage"} -blockdev {"driver":"file","filename":"/var/lib/nova/instances/d4920f5a-0491-4e1f-b1fc-875660d11eda/disk.eph0","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-3-format"} -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=libvirt-1-format,id=virtio-disk1,write-cache=on -netdev tap,fd=42,id=hostnet0,vhost=on,vhostfd=44 -device virtio-net-pci,host_mtu=1500,netdev=hostnet0,id=net0,mac=fa:16:3e:dd:7e:73,bus=pci.0,addr=0x3 -add-fd set=3,fd=41 -chardev pty,id=charserial0,logfile=/dev/fdset/3,logappend=on -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0,bus=usb.0,port=1 -audiodev {"id":"audio1","driver":"none"} -vnc 10.246.117.211:2,audiodev=audio1 -device virtio-vga,id=video0,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object {"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"} -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -device vmcoreinfo -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

Nothing obviously jumps out as wrong, though. Still lost as to what's going wrong.

  • That error comes from Grub and means the filesystem from which it is trying to boot is corrupt (or missing?). If I were diagnosing this I would probably log into the hypervisor on which this vm was running and attempt to validate the disk image from which it was booting (for a `raw` image using `losetup` to connect it to a block device and then attempting to mount it; for a `qcow2` image possibly looking into `qemu-nbd`, etc). – larsks Jul 14 '23 at 15:36
  • I've connected to the hypervisors and found the images in /var/lib/instances/. The base image (in _base/) image shows up at a DOS/MBR boot sector according to the linux `file` command. The instance image (in /disk) shows qcow2 backed by the base image. File sizes are as I'd expect (112mb for the RAW DOS/MBR copy of cirros in the base file and only 193k in the instance copy as the instance hasn't actually booted yet.) Even if these files are corrupted, I'm left wondering why openstack isn't setting them up properly. I'll see if I can mount them to verify sanity. – Jason Hiser Jul 14 '23 at 23:49
  • Thanks for the tips @larsks, I think the images on the hypervisor are OK. See updates to original question. – Jason Hiser Jul 15 '23 at 00:44

1 Answers1

0

I finally figured out what the problem was. My openstack flavor had --ram set to 2. Meaning 2MB, not what I thought it was (2GB). This was causing grub to fail to load the kernel, giving very unhelpful error messages. And yes, this did take me over a month to figure out. :/