-2

I want to replace 1Tb with 2Tb. In my small Ceph cluster.

Pool configured with 3x replica.

Added a new drive.

Did out for 2 OSDs.

Did ceph osd reweight-by-utilization. It was some rebalancing process.

Did down for 2 OSDs.

But in kubernetes they change their status from down to up after 5 minutes and receive data again.

I have a few questions and ask for help and advice.

  1. Can I delete osd.0 and osd.1 without losing data?
  2. Why if I disabled these 2 osds they still have 313 PGs? It should be so? According to my understanding, if I disable OSDs then PG should be 0.
  3. Due to the fact that I disabled these 2 OSDs, I have exactly 33.34 objects with the degraded status.
  4. If I delete these 2 osds, then the cluster will remap me again and everything will be all right?
  5. Why is there so much data on the osd.0, which was the very first one, and they are not rebalanced to other disks in any way? The fact is that initially there were 3 OSDs of 1Tb each, and now we are replacing everything with 2Tb.
  6. Why such a big difference in the balance between OSDs? Between the first and the second, between the first, the second and the rest?

OSDs: enter image description here

Ceph Status:

ceph -s
  cluster:
    id:     995ea7a6-9287-4e97-862e-64cf4e21213f
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum c,e,f (age 4d)
    mgr: b(active, since 4d), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 6 osds: 6 up (since 2d), 4 in (since 2d); 313 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 313 pgs
    objects: 255.11k objects, 976 GiB
    usage:   1.9 TiB used, 5.9 TiB / 7.8 TiB avail
    pgs:     255112/765336 objects misplaced (33.333%)
             313 active+clean+remapped
 
  io:
    client:   2.7 KiB/s rd, 101 KiB/s wr, 3 op/s rd, 9 op/s wr

ceph balancer status

{
    "active": true,
    "last_optimize_duration": "0:00:00.001511",
    "last_optimize_started": "Tue Aug 15 12:23:18 2023",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect",
    "plans": []
}

ceph osd tree

ID   CLASS  WEIGHT   TYPE NAME                                STATUS  REWEIGHT  PRI-AFF
 -1         9.58896  root default                                                      
 -5         9.58896      region n1                                                   
 -4         9.58896          zone n1-d3                                             
 -3         0.79999              host 6d855885b8-z8bj2                           
  0    ssd  0.79999                  osd.0                        up         0  1.00000
-11         3.90619              host 7b5fb4c8b8-cc9kp                           
  2    ssd  1.95309                  osd.2                        up   1.00000  1.00000
  3    ssd  1.95309                  osd.3                        up   1.00000  1.00000
 -9         4.88278              host 7b5fb4c8b8-sqx5g                           
  1    ssd  0.97659                  osd.1                        up         0  1.00000
  4    ssd  1.95309                  osd.4                        up   1.00000  1.00000
  5    ssd  1.95309                  osd.5                        up   1.00000  1.00000

Thanks!

JDev
  • 2,157
  • 3
  • 31
  • 57

1 Answers1

0

First, your cluster is really small and kind of a corner case when it comes to balancing. Although the balancer is able to some work, it might be far from perfect with so few OSDs.

  1. Can I delete osd.0 and osd.1 without losing data?

No, usually the default failure-domain is host, removing osd.0 would remove all replicas from that OSD. I would strongly advise to add the larger disks first, "stop" and set the crush reweight to 0 (not only reweight, otherwise you'll have multiple data movements) for the smaller OSD and let it rebalance. When the recovery has finished, purge the small OSD.

  1. Why if I disabled these 2 osds they still have 313 PGs? It should be so? According to my understanding, if I disable OSDs then PG should be 0.

You don't have enough hosts/OSDs to let a whole failure-domain (host) recover on the remaining hosts.

  1. Due to the fact that I disabled these 2 OSDs, I have exactly 33.34 objects with the degraded status.

Yes, that's expected as explained in the previous statements, you're missing a whole failure-domain to safely recover from a host failure.

  1. If I delete these 2 osds, then the cluster will remap me again and everything will be all right?

No, you need at least one OSD on host 6d855885b8-z8bj2 to have three failure-domains.

  1. Why is there so much data on the osd.0, which was the very first one, and they are not rebalanced to other disks in any way? The fact is that initially there were 3 OSDs of 1Tb each, and now we are replacing everything with 2Tb.

Still the issue with failure-domain.

  1. Why such a big difference in the balance between OSDs? Between the first and the second, between the first, the second and the rest?

Because your hosts have different weights, check out the "WEIGHT" column. If you have three hosts with 2 TB OSDs each, it would look different. I'm not sure if it was a mistake to assign two 2 TB disks to host 7b5fb4c8b8-sqx5g but that might be your real issue here.

In such a tiny environment, replacing disks will always result in degraded objects since they have no where to recover to. That's not a critical issue per se, we have done that many times (failing disks sometimes can't be properly drained), it just explains what you're seeing. In a larger environment you can drain an OSD, wait until it has rebalanced and add a new disk, then it's rebalanced back to the new disk. That's the safest way, depending on the configured resiliency one can skip the draining and just replace disks. But you don't seem to be very familiar with Ceph, so I would go with the safe option.

eblock
  • 579
  • 3
  • 5
  • thanks for the extended answer. Yes, I'm now "learning by doing". ))) You are right about everything and everything you say is correct. I already fixed everything. Yes, the most reasonable thing to do is to add another disk to the host. And if necessary, then reload mgr. – JDev Aug 18 '23 at 14:05