4

While looking into the resource balancer and dynamic load metrics on Service Fabric, we ran into some questions (Running devbox SDK GA 2.0.135).
In the Service Fabric Explorer (the portal and the standalone application) we can see that the balancing is ran very often, most of the time it is done almost instantly and this happens every second. While looking at the Load Metric Information on the nodes or partitions it is not updating the values as we report load.

We send a dynamic load report based on our interaction (a HTTP request to a service), increasing the reported load data of a single partition by a large amount. This spike becomes visible somewhere in 5 minutes at which point the balancer actually starts balancing. This seems to be an interval in which the load data gets refreshed. The last reported time gets updated all the time but without the new value.

We added the metrics to applicationmanifest and the clustermanifest to make sure it gets used in the balancing. This means the resource balancer uses the same data for 5 minutes. Is this a configurable setting? Is it constraint because it is running on a devbox? We tried a lot of variables in the clustermanifest but none seem to be affecting this refreshtime.

If this is not adaptable, can someone explain why would you run the balancer with stale data? and why this 5 minute interval was chosen?

P. Gramberg
  • 329
  • 1
  • 3
  • 14

1 Answers1

8

This is indeed a configurable setting, and the default is 5 minutes. The idea behind it is that in prod you have tons of replicas all reporting load all the time, and so you want to batch them up so you don't spam the Cluster Resource Manager with all those as independent messages.

You're probably right in that this value is way too long for local development. We'll look into changing that for the local clusters, but in the meantime you can add the following to your local cluster manifest to change the amount of time we wait by default. If there are other settings already in there, just add the SendLoadReportInterval line. The value is in seconds and you can adjust it accordingly. The below would change the default load reporting interval from 5 minutes (300 seconds) to 1 minute (60 seconds).

    <Section Name="ReconfigurationAgent">
        <Parameter Name="SendLoadReportInterval" Value="60" />
    </Section>

Please note that doing so does increase load on some of the system services (TANSTAAFL), and as always if you're operating on a generated or complete cluster manifest be sure to Test-ServiceFabricClusterManifest before deploying it. If you're working with a local development cluster the easiest way to get it deployed is probably just to modify the cluster manifest template (by default here: "C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\NonSecure\ClusterManifestTemplate.xml") and just add the line, then right click on the Service Fabric Local Cluster Manager in your system tray and select "Reset Local Cluster". This will regenerate the local cluster with your changes to the template.

masnider
  • 2,609
  • 13
  • 20
  • You mentioned local dev scenario, but how do we accomplish this for an Azure cluster deployed with an ARM JSON template? I don't see the 'SendLoadReportInterval' setting listed on [Customize Service Fabric cluster settings](https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings) – rktect May 18 '17 at 21:31
  • That's because it isn't really a setting we think people should touch and hence is marked Internal - tweaking this value can break your cluster by introducing too much load on system services. In production we don't see people needing to touch it in practice, so it's not described. The same syntax translation from XML to JSON that you see [here](https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-resource-manager-cluster-description#buffered-capacity) for example would apply if for some reason you did want to change this in prod. Not recommended. – masnider May 24 '17 at 22:26
  • Fair enough - very helpful. I want to stick to recommended settings but just for the sake of my [lack of] understanding - say I have a watchdog service monitoring my metrics, does this mean in production it is generally acceptable for any visibility or reactive actions to be delayed by up to five minutes? I'm trying to leverage the built-in functionality as much as possible. – rktect May 25 '17 at 01:22
  • Yes, that's what would happen, and that's the upper bound. And that sort of delay is general acceptable for balancing and for fixing these idealized metric capacity constraints. This is cluster resource management - optimization. Also keep in mind that as this scales up to many nodes and services - if things are really that frothy then you'll be seeing constraints get violated and fixed all the time (every run of the Cluster Resource Manager - every few seconds) since the nodes end up smeared across the whole 5 minute reporting window. In practice they end up "constantly reporting". – masnider May 25 '17 at 18:29
  • Also just as a note - If you really need hard, instant governance of resource consumption then /logical/ metrics probably aren't the best place for those requirements and you should look at the [Resource Governance](https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-resource-governance) stuff that we released in 5.6, but that's only for real resources (CPU and Memory as of this writing). – masnider May 25 '17 at 18:30
  • Extremely helpful responses - thanks @masnider for taking the time. Side note that you and the SF team rock! – rktect May 25 '17 at 19:50
  • Also, for now I'm using a metric weight of Zero and my custom logical metrics are just for visibility (not resource management/governance). I wanted to leverage the internal SF metrics for this as it does all the aggregation and query exposure already. – rktect May 25 '17 at 19:56