Partition is in quorum loss

Question

I have a Service Fabric application that has a stateless web api and a stateful service with two partitions. The stateless web api defines a web api controller and uses ServiceProxy.Create to get a remoting proxy for the stateful service. The remoting call puts a message into a reliable queue.

The stateful service will dequeue the messages from the queue every X minutes.

I am looking at the Service Fabric explorer and my application has been in an error state for the past few days. When I drill down into the details the stateful service has the following error:

Error event: SourceId='System.FM', Property='State'. Partition is in quorum loss.

Looking at the explorer I see that I have my primary replica up and running and it seems like a single ActiveSecondary, but the other two replicas show IdleSecondary and they keep going into a Standby / In Build state. I cannot figure out why this is happening.

What are some of the reasons my other secondaries keep failing to get to an ActiveSecondary state / causing this quorum loss?

can you add the output of the powershell command 'Get-ServiceFabricClusterHealth' ? — LoekD, Sep 20 '16 at 14:25
It reports that my service and partitions are unhealthy, but does not give any details. — Dismissile, Sep 20 '16 at 14:40
5 Node cluster, Min Replica = 2, Target = 3. 2x Partitions (both of them are in this failed state though) — Dismissile, Sep 20 '16 at 15:14
How many apps do you have in the cluster? Did you update the app before that issue started to occur? — cassandrad, Oct 06 '16 at 09:57
I had 3 custom apps and just the built in service fabric apps at the time. I hadn't touched this app in over 3 weeks and hadn't really been paying attention to anything in the cluster (it's development and we were focused on other things). Then I took a look at the cluster and noticed it was in quorum loss and had no idea why. — Dismissile, Oct 06 '16 at 13:24
https://learn.microsoft.com/en-us/powershell/servicefabric/vlatest/Repair-ServiceFabricPartition?redirectedfrom=msdn I am not sure of the cause yet, but this may help resolve. I would test against a non-prod cluster before usage though, as I am not aware of the consequences of rebuilding yet. particularly the Repair-ServiceFabricPartition -System — Daniel M, Jan 27 '17 at 21:42

score 1 · Answer 1 · answered Jun 06 '18 at 07:19

1

Try to reset the cluster. I was facing the same issue having 1 partition for my service. The error was fixed with resetting the cluster

answered Jun 06 '18 at 07:19

Rumpi Guha

118
1
1
7

3

That worked for me last time -- but it seems to come back every so often. -- It amazes me that a cluster with a single node can somehow be in `InQuorumLoss`. -- Like, how the heck did it even happen -- did it lose an argument that it was having with itself?? – BrainSlugs83 Jun 07 '18 at 23:01

score 0 · Answer 2 · answered Dec 22 '16 at 03:18

Have you checked the Windows Event Log on the nodes for additional error message?

I had a similar problem, except I was using a ReliableDictionary. Did you properly implement IEquatable<T> and IComparable<T>? I had a similar problem because my T had a dictionary field, and I was calling Equals on a dictionary directly, instead of comparing the keys and values. Same thing for GetHashCode.

The clue in the event logs was this message: Assert=Cannot update an item that does not exist (null). - it only happened when I edited a key ReliableDictionary.

It's been a long time since it has occurred but I will try this next time. — Dismissile, Dec 22 '16 at 13:24

score 0 · Answer 3 · answered Apr 26 '23 at 04:14

0

Repair-ServiceFabricPartition -All

https://learn.microsoft.com/en-us/powershell/module/servicefabric/repair-servicefabricpartition?view=azureservicefabricps

answered Apr 26 '23 at 04:14

I Stand With Russia

6,254
8
39
67

Partition is in quorum loss

3 Answers3