6

I have a Service Fabric application that has a stateless web api and a stateful service with two partitions. The stateless web api defines a web api controller and uses ServiceProxy.Create to get a remoting proxy for the stateful service. The remoting call puts a message into a reliable queue.

The stateful service will dequeue the messages from the queue every X minutes.

I am looking at the Service Fabric explorer and my application has been in an error state for the past few days. When I drill down into the details the stateful service has the following error:

Error event: SourceId='System.FM', Property='State'. Partition is in quorum loss.

Looking at the explorer I see that I have my primary replica up and running and it seems like a single ActiveSecondary, but the other two replicas show IdleSecondary and they keep going into a Standby / In Build state. I cannot figure out why this is happening.

What are some of the reasons my other secondaries keep failing to get to an ActiveSecondary state / causing this quorum loss?

enter image description here

Dismissile
  • 32,564
  • 38
  • 174
  • 263
  • can you add the output of the powershell command 'Get-ServiceFabricClusterHealth' ? – LoekD Sep 20 '16 at 14:25
  • It reports that my service and partitions are unhealthy, but does not give any details. – Dismissile Sep 20 '16 at 14:40
  • How many nodes and replicas do you have in configs? – Serge Semenov Sep 20 '16 at 15:09
  • 5 Node cluster, Min Replica = 2, Target = 3. 2x Partitions (both of them are in this failed state though) – Dismissile Sep 20 '16 at 15:14
  • How many apps do you have in the cluster? Did you update the app before that issue started to occur? – cassandrad Oct 06 '16 at 09:57
  • I had 3 custom apps and just the built in service fabric apps at the time. I hadn't touched this app in over 3 weeks and hadn't really been paying attention to anything in the cluster (it's development and we were focused on other things). Then I took a look at the cluster and noticed it was in quorum loss and had no idea why. – Dismissile Oct 06 '16 at 13:24
  • https://learn.microsoft.com/en-us/powershell/servicefabric/vlatest/Repair-ServiceFabricPartition?redirectedfrom=msdn I am not sure of the cause yet, but this may help resolve. I would test against a non-prod cluster before usage though, as I am not aware of the consequences of rebuilding yet. particularly the Repair-ServiceFabricPartition -System – Daniel M Jan 27 '17 at 21:42

3 Answers3

1

Try to reset the cluster. I was facing the same issue having 1 partition for my service. The error was fixed with resetting the cluster

Rumpi Guha
  • 118
  • 1
  • 1
  • 7
  • 3
    That worked for me last time -- but it seems to come back every so often. -- It amazes me that a cluster with a single node can somehow be in `InQuorumLoss`. -- Like, how the heck did it even happen -- did it lose an argument that it was having with itself?? – BrainSlugs83 Jun 07 '18 at 23:01
0

Have you checked the Windows Event Log on the nodes for additional error message?

I had a similar problem, except I was using a ReliableDictionary. Did you properly implement IEquatable<T> and IComparable<T>? I had a similar problem because my T had a dictionary field, and I was calling Equals on a dictionary directly, instead of comparing the keys and values. Same thing for GetHashCode.

The clue in the event logs was this message: Assert=Cannot update an item that does not exist (null). - it only happened when I edited a key ReliableDictionary.

enter image description here

aoetalks
  • 1,741
  • 1
  • 13
  • 26