8

I am having a 3 node Akka Cluster and 3 actors are running on each node of the cluster. The cluster is running fine for some 2 hours but after 2 hours I am getting the following warning:

[INFO] [06/07/2018 15:08:51.923] [ClusterSystem-akka.remote.default-remote-dispatcher-6] [akka.tcp://ClusterSystem@192.168.2.8:2552/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FClusterSystem%40192.168.2.7%3A2552-112] No response from remote for outbound association. Handshake timed out after [15000 ms].

[WARN] [06/07/2018 15:08:51.923] [ClusterSystem-akka.remote.default-remote-dispatcher-18] [akka.tcp://ClusterSystem@192.168.2.8:2552/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FClusterSystem%40192.168.2.7%3A2552-8] Association with remote system [akka.tcp://ClusterSystem@192.168.2.7:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://ClusterSystem@192.168.2.7:2552]] Caused by: [No response from remote for outbound association. Handshake timed out after [15000 ms].]

[WARN] [06/07/2018 16:07:06.347] [ClusterSystem-akka.actor.default-dispatcher-101] [akka.remote.PhiAccrualFailureDetector@3895fa5b] heartbeat interval is growing too large: 2839 millis

Edit: The Akka CLuster Managemant Response from the API

{
  "selfNode": "akka.tcp://ClusterSystem@127.0.0.1:2551",
  "leader": "akka.tcp://ClusterSystem@127.0.0.1:2551",
  "oldest": "akka.tcp://ClusterSystem@127.0.0.1:2551",
  "unreachable": [
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2552",
      "observedBy": [
        "akka.tcp://ClusterSystem@127.0.0.1:2551",
        "akka.tcp://ClusterSystem@127.0.0.1:2560"
      ]
    }
  ],
  "members": [
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2551",
      "nodeUid": "105742380",
      "status": "Up",
      "roles": [
        "Frontend",
        "dc-default"
      ]
    },
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2552",
      "nodeUid": "-150160059",
      "status": "Up",
      "roles": [
        "RuleExecutor",
        "dc-default"
      ]
    },
    {
      "node": "akka.tcp://ClusterSystem@127.0.0.1:2560",
      "nodeUid": "-158907672",
      "status": "Up",
      "roles": [
        "RuleExecutor",
        "dc-default"
      ]
    }
  ]
}

**Edit1: ** Cluster Setup Configuration and Failure Detector Configuration

cluster {
      jmx.multi-mbeans-in-same-jvm = on
      roles = ["Frontend"]
      seed-nodes = [
        "akka.tcp://ClusterSystem@192.168.2.9:2551"]
      auto-down-unreachable-after = off

      failure-detector {

        # FQCN of the failure detector implementation.
        # It must implement akka.remote.FailureDetector and have
        # a public constructor with a com.typesafe.config.Config and
        # akka.actor.EventStream parameter.
        implementation-class = "akka.remote.PhiAccrualFailureDetector"

        # How often keep-alive heartbeat messages should be sent to each connection.
        # heartbeat-interval = 10 s

        # Defines the failure detector threshold.
        # A low threshold is prone to generate many wrong suspicions but ensures
        # a quick detection in the event of a real crash. Conversely, a high
        # threshold generates fewer mistakes but needs more time to detect
        # actual crashes.
        threshold = 18.0

        # Number of the samples of inter-heartbeat arrival times to adaptively
        # calculate the failure timeout for connections.
        max-sample-size = 1000

        # Minimum standard deviation to use for the normal distribution in
        # AccrualFailureDetector. Too low standard deviation might result in
        # too much sensitivity for sudden, but normal, deviations in heartbeat
        # inter arrival times.
        min-std-deviation = 100 ms

        # Number of potentially lost/delayed heartbeats that will be
        # accepted before considering it to be an anomaly.
        # This margin is important to be able to survive sudden, occasional,
        # pauses in heartbeat arrivals, due to for example garbage collect or
        # network drop.
        acceptable-heartbeat-pause = 15 s

        # Number of member nodes that each member will send heartbeat messages to,
        # i.e. each node will be monitored by this number of other nodes.
        monitored-by-nr-of-members = 2

        # After the heartbeat request has been sent the first failure detection
        # will start after this period, even though no heartbeat message has
        # been received.
        expected-response-after = 10 s

      }

    }
Prog_G
  • 1,539
  • 1
  • 8
  • 22
  • 1
    You should add akka http management to your project to see whats going on, see: https://developer.lightbend.com/docs/akka-management/current/cluster-http-management.html – Bennie Krijger Jun 20 '18 at 17:26
  • @BennieKrijger The actor is getting unreachable after some time. I check the network and I found that there is no any issue with the network. – Prog_G Jun 21 '18 at 08:53
  • @BennieKrijger I have added the Response from the Management API – Prog_G Jun 21 '18 at 09:20
  • What are you failure detector settings? And are you running on AWS? See: https://doc.akka.io/docs/akka/current/cluster-usage.html?language=scala#failure-detector for recommended settings – Bennie Krijger Jun 21 '18 at 09:22
  • @BennieKrijger I have added the detector settings. – Prog_G Jun 21 '18 at 09:31
  • @BennieKrijger no i am not running on AWS – Prog_G Jun 21 '18 at 10:09
  • 1
    It could be a possibility than some of the nodes run out of heap space and takes too long for garbage collection. This can cause the heartbeat to grow large and eventually gets timed out. So you could check if your nodes are taking too long for GC operations. – Vishnu P N Jul 17 '18 at 10:58
  • when i start my program that time only it started giving me this `[WARN] [06/07/2018 16:07:06.347] [ClusterSystem-akka.actor.default-dispatcher-101] [akka.remote.PhiAccrualFailureDetector@3895fa5b] heartbeat interval is growing too large: 2839 millis` – Prog_G Jul 17 '18 at 11:45

0 Answers0