Ceph PGs not deep scrubbed in time keep increasing

Question

I've noticed this about 4 days ago and dont know what to do right now. The problem is as follows:

I have a 6 node 3 monitor ceph cluster with 84 osds, 72x7200rpm spin disks and 12xnvme ssds for journaling. Every value for scrub configurations are the default values. Every pg in the cluster is active+clean, every cluster stat is green. Yet PGs not deep scrubbed in time keeps increasing and it is at 96 right now. Output from ceph -s:

  cluster:
    id:     xxxxxxxxxxxxxxxxx
    health: HEALTH_WARN
            1 large omap objects
            96 pgs not deep-scrubbed in time

  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 6h)
    mgr: mon2(active, since 2w), standbys: mon1
    mds: cephfs:1 {0=mon2=up:active} 2 up:standby
    osd: 84 osds: 84 up (since 4d), 84 in (since 3M)
    rgw: 3 daemons active (mon1, mon2, mon3)

  data:
    pools:   12 pools, 2006 pgs
    objects: 151.89M objects, 218 TiB
    usage:   479 TiB used, 340 TiB / 818 TiB avail
    pgs:     2006 active+clean

  io:
    client:   1.3 MiB/s rd, 14 MiB/s wr, 93 op/s rd, 259 op/s wr

How do i solve this problem? Also ceph health detail output shows that this non deep-scrubbed pg alerts started in january 25th but i didn't notice this before. The time I noticed this was when an osd went down for 30 seconds and got up. Might it be related to this issue? will it just resolve itself? should i tamper with the scrub configurations? For example how much performance loss i might face on client side if i increase osd_max_scrubs to 2 from 1?

score 2 · Accepted Answer · answered Feb 09 '21 at 07:14

2

Usually the cluster deep-scrubs itself during low I/O intervals on the cluster. The default is every PG has to be deep-scrubbed once a week. If OSDs go down they can't be deep-scrubbed, of course, this could cause some delay. You could run something like this to see which PGs are behind and if they're all on the same OSD(s):

ceph pg dump pgs | awk '{print $1" "$23}' | column -t

Sort the output if necessary, and you can issue a manual deep-scrub on one of the affected PGs to see if the number decreases and if the deep-scrub itself works.

ceph pg deep-scrub <PG_ID>

Also please add ceph osd pool ls detail to see if any flags are set.

answered Feb 09 '21 at 07:14

eblock

579
3
5

The non deep-scrubbed pg count got stuck at 96 until the scrub timer started. I suppose it will get stuck at 96 again today. I have a 96 pg count on each osd according to my cephmgr dashboard. I'm really starting to think if this is related to that downed osd. – Nyquillus Feb 09 '21 at 09:29
I also want to ask something else, I stopped all data flow to my cephfs pool to see how this scrubbing works out. Used space is at 273.3 terabytes but free space keep decreasing on its own. Decreased almost around 1.5 terabytes since i stopped the data transfer. Also related to this, cephfs data pool is 3+2 erasure code. I have transferred approximately 158-160 terabytes of data. On 3+2 if i'm correct it should be about 265-270 terabytes with all the chunks combined so there is no problem on that part right? – Nyquillus Feb 09 '21 at 09:34
Scrubbing does not free space if that's what you're asking. (Deep-)Scrubbing compares all replicas and their checksums to keep the data chunks consistent. Your assumption wrt 3+2 profile is correct, but you have to keep in mind that the actual usage can be higher due to `bluestore_min_alloc_size_hdd`, read this [https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/NIVVTSR2YW22VELM4BW4S6NQUCS3T4XW/](thread) for some details. – eblock Feb 09 '21 at 10:04
Why your free space is decreasing although you stopped IO can have multiple reasons. You could check the RGW logs in order to see who's still writing to the cluster. But keep in mind that for example delete operations can take quite some time and reduce the free space until all deleted objects have processed garbage collection. Also are you using snapshots? – eblock Feb 09 '21 at 10:07
we dont use snapshots as far as im aware. – Nyquillus Feb 09 '21 at 10:41
not deep-scrubbed pg count went up from 96 to 100, we've been using this cluster for months this never happened, im starting to feel worried about this. – Nyquillus Feb 09 '21 at 10:56
1

At this point there's no need to worry as long as you don't inactive PGs. You don't have that many PGs, you could issue manual deep-scrubs on a couple PGs and see how far you get. – eblock Feb 09 '21 at 11:06
everything seems active+clean and as i mentioned on the question every cluster stat is still green. So in a disaster scenario(like needing a full recovery and such) what drawback should i expect by having non completed deep scrubs just to be certain? – Nyquillus Feb 09 '21 at 11:10
1

Deep-scrubs can find inconsistencies between the replicas of a PG. So there is a possibility that in a DR scenario you end up with false data. You definitely should resolve this issue to be safe, but it's not extremely urgent. – eblock Feb 09 '21 at 15:04

score 2 · Answer 2 · answered Feb 17 '21 at 22:10

You can set the deep scrub period to 2 week, to stretch the deep scrub window. Insted of

 osd_deep_scrub_interval = 604800

use:

 osd_deep_scrub_interval = 1209600

Mr. Eblock has a good idea to force manually some of the pgs for deep scrub , to spread the actions evently within 2 week.

Arkadiy Bolotov · Answer 3 · 2021-11-27T03:58:10.113

You have 2 options:

Increase the interval between deep scrubs.
Control deep scrubbing manually with a standalone script.

I've written a simple PHP script which takes care of deep scrubbing for me: https://gist.github.com/ethaniel/5db696d9c78516308b235b0cb904e4ad

It lists all the PGs, picks 1 PG which have a last deep scrub done more than 2 weeks ago (the script takes the oldest one), checks if the OSDs that the PG sits on are not being used for another scrub (are in active+clean state), and only then starts a deep scrub on that PG. Otherwise it goes looking for another PG.

I have osd_max_scrubs set to 1 (otherwise OSD daemons start crashing due to a bug in Ceph), so this script works nicely with the regular scheduler - whichever starts the scrubbing on a PG-OSD first, wins.

Ceph PGs not deep scrubbed in time keep increasing

3 Answers3