1

I installed a rook.io ceph storage cluster. Before installation, I cleaned up the previous installation like described here: https://rook.io/docs/rook/v1.7/ceph-teardown.html

The new cluster was provisioned correctly, however ceph is not healthy immediately after provisioning, and stuck.

  data:
pools:   1 pools, 128 pgs
objects: 0 objects, 0 B
usage:   20 MiB used, 15 TiB / 15 TiB avail
pgs:     100.000% pgs not active
         128 undersized+peered
[root@rook-ceph-tools-74df559676-scmzg /]# ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  3.63869   1.00000  3.6 TiB  5.0 MiB  144 KiB   0 B  4.8 MiB  3.6 TiB     0  0.98    0      up
 1    hdd  3.63869   1.00000  3.6 TiB  5.4 MiB  144 KiB   0 B  5.2 MiB  3.6 TiB     0  1.07  128      up
 2    hdd  3.63869   1.00000  3.6 TiB  5.0 MiB  144 KiB   0 B  4.8 MiB  3.6 TiB     0  0.98    0      up
 3    hdd  3.63869   1.00000  3.6 TiB  4.9 MiB  144 KiB   0 B  4.8 MiB  3.6 TiB     0  0.97    0      up
                       TOTAL   15 TiB   20 MiB  576 KiB   0 B   20 MiB   15 TiB     0                   
MIN/MAX VAR: 0.97/1.07  STDDEV: 0
[root@rook-ceph-tools-74df559676-scmzg /]# ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
-1         14.55475  root default                                                     
-3         14.55475      host storage1-kube-domain-tld                           
 0    hdd   3.63869          osd.0                               up   1.00000  1.00000
 1    hdd   3.63869          osd.1                               up   1.00000  1.00000
 2    hdd   3.63869          osd.2                               up   1.00000  1.00000
 3    hdd   3.63869          osd.3                               up   1.00000  1.00000

Is there anyone who can explain what went wrong and how to fix the issue?

roman
  • 892
  • 9
  • 26

1 Answers1

2

The problem is that osds are running on the same host and failure domain is set to host. Switching failure domain to osd fixes the issue. The default failure domain can be changed as per https://stackoverflow.com/a/63472905/3146709

roman
  • 892
  • 9
  • 26
  • This fixed it to me, I created an vagrant+ansible repository that has 3 multimachine hosts on vagrant and 1 osd per machine for 1 drive on each machine. https://github.com/samsquire/ceph I added osd crush chooseleaf type = 0 to my ceph.conf under [global] – Samuel Squire Aug 04 '22 at 15:54