12

I read in Akka's documentation that when using cluster singleton one should avoid using automatic downing. I don't understand how should downing be configured in that case. I understand that I may subscribe to cluster membership events and plan my strategy according to those messages. However, I don't understand how practically it will be different from automatic downing.

When a node is somehow partitioned from the cluster, if automatic downing is used, the partitioned node will "think" that the entire cluster went missing and start a cluster of its own (with its own singleton). But, on the other hand, I can't keep unreachable nodes in unreachable state forever because the cluster won't reach convergence (new nodes won't be able to join) and if the partitioned node is the singleton itself a new singleton node won't be assigned and therefor, according to my understanding, the only thing that is left to do is to remove unreachable nodes after some grace time which is exactly what automatic downing does.

What do I miss here?

Mizh
  • 361
  • 2
  • 11
  • 1
    I have the same question as you. It seems that we have no way to prevent 2 cluster partition start their own Cluster Singleton – mingchuno Jun 03 '15 at 16:43

1 Answers1

1

check out the below code. I have turn off the auto-down-unreachable-after feature as the doc said. Instead, I implement a custom logic which is a bit different from normal. The key of the below code is if network partition happens, only cluster nodes which have majority will take down UnreachableMember after some config-able 5s. On the other hand, the minority of the cluster nodes will tread their the UnreachableMember (which is the majority group as unreachable and do not take them down to form an island. The idea of the number of majority is borrow from MongoDB which I think is not new in computer science area.

class ClusterListener extends Actor with ActorLogging {

  val cluster = Cluster(context.system)
  var unreachableMember: Set[Member] = Set()

  // subscribe to cluster changes, re-subscribe when restart 
  override def preStart(): Unit = {
    //#subscribe
    cluster.subscribe(self, initialStateMode = InitialStateAsEvents, classOf[UnreachableMember], classOf[ReachableMember])
    //#subscribe
  }
  override def postStop(): Unit = cluster.unsubscribe(self)

  def receive = {
    case UnreachableMember(member) =>
      log.info("Member detected as unreachable: {}", member)
      val state = cluster.state
      if (isMajority(state.members.size, state.unreachable.size)) {
        scheduletakeDown(member)
      }
    case ReachableMember(member) =>
      unreachableMember = unreachableMember - member
    case _: MemberEvent => // ignore
    case "die" =>
      unreachableMember.foreach { member =>
        cluster.down(member.address)
      }
  }

  // find out majority number of the group
  private def majority(n: Int): Int = (n+1)/2 + (n+1)%2

  private def isMajority(total: Int, dead: Int): Boolean = {
    require(total > 0)
    require(dead >= 0)
    (total - dead) >= majority(total)
  }

  private def scheduletakeDown(member: Member) = {
    implicit val dispatcher = context.system.dispatcher
    unreachableMember = unreachableMember + member
    // make 5s config able!!!
    context.system.scheduler.scheduleOnce(5 seconds, self, "die")
  }

}
mingchuno
  • 577
  • 1
  • 3
  • 10
  • Thanks for your comment but I don't understand something. When the partitioned (minority) nodes return to the majority, lets say the network/gc issues were resolved, unless the minority nodes restart their actor system (in order to regenerate new token), would they be able to connect to the majority again? because as far as I know If a node was deleted from the cluster it can't return with the same token. – Mizh Jun 04 '15 at 10:28
  • Hello, I know this is old. But for anyone looking for the answer. The partitioned nodes that were marked as down by the majority would explicitly have to be restarted in order to join the cluster once again. http://doc.akka.io/docs/akka/2.4.2/scala/cluster-usage.html#Joining_to_Seed_Nodes – Cal Feb 22 '16 at 16:27
  • 1
    I have tested this on Akka 2.4.3 and have confirmed it to be working. I have two machines, I run the majority of my nodes on Machine A and a minority on Machine B. I sever the connection between between A and B to simulate a partition. Machine A nodes issue takedowns on the minority (Machine B nodes) and the minority nodes are able to realize that they are in fact the minority. Note that the minority nodes do not terminate themselves and remain running. Additional logic is needed to shut them down. Search "akka cluster split brain and reconnect" for more info. Thanks for sharing @mingchuno – Cal Apr 09 '16 at 20:11
  • Out of curiosity, do you confirm that such actor does not need to be started on all machines of the cluster ? I guess if it runs on the same nodes as the ones where there is an Singleton instance it is good enough, right ? – CanardMoussant Nov 24 '16 at 15:57