Understanding the mechanism of crush rule in ceph

Question

I would like to know the difference between these 2 rules:

# rules
rule rack_rule{
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type rack
 step emit
}

and

rule 2rack_2host{
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step choose firstn 2 type rack
 step chooseleaf firstn 2 type host
 step emit
}

In my understanding, the first rule rack_rule will take rack as failure domain as a result in every PG, we will have osds from different racks. So for example, if I have 2 racks and replication size = 2 I will have a PG [osd.1,osd.2] and these 2 osds should be from different racks.

In the second rule, I think it should select 2 different racks and for each rack it will select 2 different hosts. So, also if I have 2 racks and replication size = 2 I will have a PG [osd.1,osd.2] and these 2 osds should be from different racks.

This is theoritically, what I understood, but I don't see these expected results on practice. With these two rules, I have osds in the same rack for a PG inside a pool with replication size 2

score 1 · Answer 1 · answered Mar 30 '21 at 09:27

Your conclusion is not entirely correct. The first rule

step take default
step chooseleaf firstn 0 type rack

you did understand correctly. Ceph will choose as many racks (underneath the "default" root in the crush tree) as your size parameter for the pool defines. The second rule works a little different:

step take default
step choose firstn 2 type rack
step chooseleaf firstn 2 type host

Ceph will select exactly 2 racks underneath root "default", in each rack it then will choose 2 hosts. But this rule is designed for size = 4 not 2. By the way, don't use size = 2! If you use this rule with size 2 you'll end up exactly as you already wrote, two hosts in the same rack will have both PGs. So if one rack fails your PGs will become inactive and clients will encounter I/O errors until this resolves.

There's a tool called crushtool to test your changes before actually implementing it, it's very helpful, try it out!

Hey, thanks a lot for your answer, I am back now with a better ceph knowledge, I experimented the tool and created a lot of rules and saw how they work on a real cluster. So exactly just like you said with the second rule, 2 racks will be selected and 2 hosts from each rack will be chosen. And that would work for replication size 4 or even 3, because ceph will ignore the 4th host — USR, Mar 30 '21 at 17:32
If you have two racks and the failure-domain is also rack, don't use `size = 3`. Think about one rack having one replica and the other has two. If the rack with two replicas goes down you're left with only one "good" copy. When the failed rack comes back it has two older PG copies and only one current version, this could lead to data corruption because the correct version could be overwritten. That's also the reason to not use size = 2. — eblock, Apr 01 '21 at 10:49

Understanding the mechanism of crush rule in ceph

1 Answers1