What thresholds should be set in Service Fabric Placement / Load balancing config for Cluster with large number of guest executable applications?
I am having trouble with Service Fabric trying to place too many services onto a single node too fast.
To give an example of cluster size, there are 2-4 worker node types, there are 3-6 worker nodes per node type, each node type may run 200 guest executable applications, and each application will have at least 2 replicas. The nodes are more than capable of running the services while running, it is just startup time where CPU is too high.
The problem seems to be the thresholds or defaults for placement and load balancing rules set in the cluster config. As examples of what I have tried: I have turned on InBuildThrottlingEnabled
and set InBuildThrottlingGlobalMaxValue
to 100
, I have set the Global Movement Throttle settings to be various percentages of the total application count.
At this point there are two distinct scenarios I am trying to solve for. In both cases, the nodes go to 100% for an amount of time such that service fabric declares the node as down.
1st: Starting an entire cluster from all nodes being off without overwhelming nodes.
2nd: A single node being overwhelmed by too many services starting after a host comes back online
Here are my current parameters on the cluster:
"Name": "PlacementAndLoadBalancing", "Parameters": [ { "Name": "UseMoveCostReports", "Value": "true" }, { "Name": "PLBRefreshGap", "Value": "1" }, { "Name": "MinPlacementInterval", "Value": "30.0" }, { "Name": "MinLoadBalancingInterval", "Value": "30.0" }, { "Name": "MinConstraintCheckInterval", "Value": "30.0" }, { "Name": "GlobalMovementThrottleThresholdForPlacement", "Value": "25" }, { "Name": "GlobalMovementThrottleThresholdForBalancing", "Value": "25" }, { "Name": "GlobalMovementThrottleThreshold", "Value": "25" }, { "Name": "GlobalMovementThrottleCountingInterval", "Value": "450" }, { "Name": "InBuildThrottlingEnabled", "Value": "false" }, { "Name": "InBuildThrottlingGlobalMaxValue", "Value": "100" } ] },
Based on discussion in answer below, wanted to leave a graph-image: if a node goes down, the act of shuffling services on to the remaining nodes will cause a second node to go down, as noted here. Green node goes down, then purple goes down due to too many resources being shuffled onto it.