2

We are running 3 Node Cluster, data in memory on version 4.2.0.4 CE on AWS. We recently noticed writes are not happening and found one down. Ideally write should happen. Once we start the node which was down, the writes resumed. We are accessing the Aerospike cluster from outside the AWS.

Found below INFO Logs being printed continuously on two nodes.

INFO (hb): (hb.c:4319) found redundant connections to same node, fds 101 31 - choosing at random

On the other node, no logs being printed and no read/writes happening on asadm stats. Also we have observed that the records are unevenly distributed across the nodes.

Below is the configuration file network section consistent across all servers.

The network stanza for all 3 servers are consistent. Please find below.

network {
    service {
            address any
            port 3000
    }

    heartbeat {

            mode mesh
            port 3002 # Heartbeat port for this node.

            # List one or more other nodes, one ip-address & port per line:
            mesh-seed-address-port 13.xxx.xxx.xxx 3002
            mesh-seed-address-port 13.xxx.xxx.xxx 3002
            mesh-seed-address-port 13.xxx.xxx.xxx 3002

            interval 150
            timeout 10
    }

    fabric {
            port 3001
    }

    info {
            port 3003
    }
}
namespace smpa {
    replication-factor 2
    memory-size 12G
    storage-engine memory
    single-bin true
    high-water-memory-pct 80
    stop-writes-pct 90
}

$ asadm -e "show stat like stop_writes"

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-01-24 12:24:42 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                              :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
cluster_clock_skew_stop_writes_sec:   0                               0                               0                               

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-24 12:24:42 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                  :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
clock_skew_stop_writes:   false                           false                           false                           
stop_writes           :   false                           false                           false                           

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~test Namespace Statistics (2019-01-24 12:24:42 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                  :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
clock_skew_stop_writes:   false                           false                           false                           
stop_writes           :   false                           false                           false   

$ asadm -e "show stat like x_partitions"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-24 12:30:01 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                           :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
migrate_rx_partitions_active   :   0                               0                               0                               
migrate_rx_partitions_initial  :   0                               2749                            0                               
migrate_rx_partitions_remaining:   0                               0                               0                               
migrate_tx_partitions_active   :   0                               0                               0                               
migrate_tx_partitions_imbalance:   0                               0                               0                               
migrate_tx_partitions_initial  :   1396                            0                               1353                            
migrate_tx_partitions_remaining:   0                               0                               0

$ asadm -e "show pmap"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Partition Map Analysis (2019-01-24 12:33:39 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     Cluster   Namespace                            Node      Primary    Secondary         Dead   Unavailable   
         Key           .                               .   Partitions   Partitions   Partitions    Partitions   
BEF4A1479187   smpa        node6.domain.com:3000         1382         1367            0             0   
BEF4A1479187   smpa        node7.domain.com:3000         1358         1342            0             0   
BEF4A1479187   smpa        node5.domain.com:3000         1356         1387            0             0   
BEF4A1479187   test        node6.domain.com:3000         1382            0            0             0   
BEF4A1479187   test        node7.domain.com:3000         1358            0            0             0   
BEF4A1479187   test        node5.domain.com:3000         1356            0            0             0   
Number of rows: 6

$ asadm -e "show stat like objects"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-01-24 12:34:09 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                       :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
objects                    :   6478039                         6485049                         9265180                         
sindex_gc_objects_validated:   0                               0                               0                               

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-24 12:34:09 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                 :   node5.domain.com:3000   node6.domain.com:3000   node7.domain.com:3000   
evicted_objects      :   0                               0                               0                               
expired_objects      :   0                               0                               0                               
master_objects       :   2944752                         3456686                         4712696                         
non_expirable_objects:   2943325                         3455765                         4711880                         
non_replica_objects  :   0                               0                               0                               
objects              :   6478039                         6485049                         9265180                         
prole_objects        :   3533287                         3028363                         4552484                         

$ asadm -e "info"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information (2019-01-25 06:54:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                    Node               Node                    Ip       Build   Cluster   Migrations        Cluster     Cluster         Principal   Client     Uptime   
                                                       .                 Id                     .           .      Size            .            Key   Integrity                 .    Conns          .   
ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   BB9BE0093E32B0A    xx.xxx.xxx.xxx:3000   C-4.2.0.4         3      0.000     3ADA511969DD   True        BB9EAC87115AD0A       59   01:09:24   
ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   *BB9EAC87115AD0A   xx.xxx.xxx.xxx:3000   C-4.2.0.4         3      0.000     3ADA511969DD   True        BB9EAC87115AD0A       59   01:05:17   
ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   BB9D4175485B10A    xx.xxx.xxx.xxx:3000   C-4.2.0.4         3      0.000     3ADA511969DD   True        BB9EAC87115AD0A       59   01:14:17   
Number of rows: 3

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Usage Information (2019-01-25 06:54:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                       Node     Total   Expirations,Evictions     Stop       Disk    Disk     HWM   Avail%        Mem     Mem    HWM      Stop   
        .                                                          .   Records                       .   Writes       Used   Used%   Disk%        .       Used   Used%   Mem%   Writes%   
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.716 M   (0.000,  0.000)         false         N/E   N/E     50      N/E      2.774 GB   24      80     90        
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.648 M   (0.000,  0.000)         false         N/E   N/E     50      N/E      2.706 GB   23      80     90        
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.709 M   (0.000,  0.000)         false         N/E   N/E     50      N/E      2.767 GB   24      80     90        
smpa                                                                   8.074 M   (0.000,  0.000)                  0.000 B                             8.247 GB                            
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Namespace Object Information (2019-01-25 06:54:14 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Namespace                                                       Node     Total     Repl                       Objects                   Tombstones             Pending   Rack   
        .                                                          .   Records   Factor    (Master,Prole,Non-Replica)   (Master,Prole,Non-Replica)            Migrates     ID   
        .                                                          .         .        .                             .                            .             (tx,rx)      .   
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.716 M   2        (1.375 M, 1.341 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.648 M   2        (1.311 M, 1.337 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
smpa        ec2-xx-xxx-xxx-xxx.ap-south-1.compute.amazonaws.com:3000   2.709 M   2        (1.351 M, 1.359 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)     0      
smpa                                                                   8.074 M            (4.037 M, 4.037 M, 0.000)     (0.000,  0.000,  0.000)      (0.000,  0.000)            

$ asadm -e "show stat like objects"

Seed:        [('127.0.0.1', 3000, None)]
Config_file: /home/web/.aerospike/astools.conf, /etc/aerospike/astools.conf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190122 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   672400                                                     662491                                                     671131                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190121 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   376064                                                     347232                                                     374700                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190124 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   629323                                                     617983                                                     628214                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190123 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   739556                                                     726447                                                     736871                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa d190125 Set Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE   :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects:   313800                                                     308814                                                     313320                                                     

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                       :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
objects                    :   2731143                                                    2662967                                                    2724236                                                    
sindex_gc_objects_validated:   0                                                          0                                                          0                                                          

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~smpa Namespace Statistics (2019-01-25 07:07:30 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                 :   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   ec2-xx.xxx.xxx.xxx.ap-south-1.compute.amazonaws.com:3000   
evicted_objects      :   0                                                          0                                                          0                                                          
expired_objects      :   0                                                          0                                                          0                                                          
master_objects       :   1382413                                                    1318579                                                    1358181                                                    
non_expirable_objects:   1382525                                                    1318691                                                    1358445                                                    
non_replica_objects  :   0                                                          0                                                          0                                                          
objects              :   2731143                                                    2662967                                                    2724236                                                    
prole_objects        :   1348730                                                    1344388                                                    1366055                                                    
Carbonrock
  • 457
  • 2
  • 15
  • asadm -e "show stat" | grep "master_objects" master_objects : 3149208 4409526 2641873 – Carbonrock Jan 23 '19 at 11:57
  • 1
    Could you provide the output for the following 4 commands: 1) `asadm -e "show stat like stop_writes"` 2) `asadm -e "show stat like x_partitions"` 3) `asadm -e "show pmap"` 4) `asadm -e "show stat like objects"` – kporter Jan 23 '19 at 22:20
  • Aerospike is running on AWS. We are accessing the Aerospike cluster from outside the AWS. – Carbonrock Jan 24 '19 at 12:51
  • Hm, could you also provide the output of: `asadm -e "info"` – kporter Jan 24 '19 at 19:14
  • @kporter, after adding domain IP mapping in /etc/hosts, and restarting the servers, the records across the nodes get balanced. But still the "found redundant connections to same node" INFO getting thrown continuously. Updated the asadm -e "info" in the question – Carbonrock Jan 25 '19 at 07:03
  • @kporter, added lates asadm -e "show stat like objects" in the Question. – Carbonrock Jan 25 '19 at 07:12
  • I'm not certain what caused the balance discrepancy, but I am certain that the IP mapping wouldn't affect the balance. Do you have logs from before you restarted theses servers? I suspect the node-id may have changed across the restart, the node-id is logged periodically as 'NODE-ID'. These node-ids are used to determine a deterministic ordering for each partition. If the node-ids did change, it would explain why the records appear balanced across nodes, though I suspect they aren't across partitions... – kporter Jan 25 '19 at 20:51
  • ... I suspect, as @pgupta mentioned, that you may have restarted two of these nodes simultaneously in the past. Since these nodes are not configured to be backed by storage you would have lost all records that were replicated between the two restarted nodes and in a 3 node cluster would appear as the imbalance you have described. – kporter Jan 25 '19 at 20:53
  • BTW, please don't create multiple posts on multiple platforms for a single question. We are equally active on both our forums as well has here on SO. https://discuss.aerospike.com/t/cross-posting-to-from-other-sites-such-as-stack-overflow/4526 – kporter Jan 25 '19 at 21:03
  • @kporter, I haven't done it so. It was running fine for past 6 months. We have suddenly noticed this behaviour. Even if we restart 2 nodes same time, we could have lost few records but the existing records could have balanced across all nodes. I was observing no logs were writing in one node and reads/writes were not happening on that node which was seen in asadm stats. Now the read/writes are happening across all nodes and the records are balanced. But "INFO (hb): (hb.c:4319) found redundant connections to same node, fds 32 128 - choosing at random" continuosly logged in all servers. – Carbonrock Jan 28 '19 at 06:10
  • The hb message is unrelated to the record balance. That message can occur from various network issues and means that hb had found it had multiple established links to another node and that is now choosing one at random and closing the others. – kporter Jan 28 '19 at 06:29
  • Records are hashed to partitions. So in the scenario I described, the records wouldn't be balanced even though the partitions are, which is what we observe from the output you have provided. – kporter Jan 28 '19 at 06:30
  • Did you get a chance to see if the node-id reported in the logs changed across the restart? – kporter Jan 28 '19 at 06:31
  • Those older logs were archived and I will get that and let you know. Mean time I am seeing some warnings also which I missed. "Jan 28 2019 05:22:40 GMT: WARNING (hb): (hb.c:4845) could not create heartbeat connection to node {13.xxx.xxx.002:3002}". This is the second node and the same is coming in all the nodes. Even on the second node. Along with ""INFO (hb): (hb.c:4319) found redundant connections to same node, fds 32 128 - choosing at random" . Please help how do I get rid on these Warnings & INFOs – Carbonrock Jan 28 '19 at 08:46
  • 2
    @Carbonrock Could you please run the following command on all the nodes 2-3 times, spaced a few seconds apart - "asinfo -v 'dump-hb:verbose=true'". This dumps heartbeat debugging details to aerospike log file. Once this is done please paste the output of "grep HB /var/log/aerospike/aerospike.log" from all the nodes. – Ashish Shinde Jan 29 '19 at 10:02
  • @Ashish-Shinde Thanks for your response. Please find it in this link: https://www.dropbox.com/s/5jmjykzgd1juytw/as_hb.out?dl=0 – Carbonrock Jan 29 '19 at 11:42
  • 2
    Are 13.x.x.x external/public/NATed ips? If so they are not assigned to a n/w interface on the nodes, i.e., "ip a" or ifconfig will not list these ips. If this is indeed the case then AS,as a design choice,does not support NATed/virtual ips for intra-cluster comm. The cluster would function but the heartbeat system is waiting for some node to report its heartbeat ip as 13.x.x.x used as mesh seeds. Using private/real ips (172.x.x.x) in mesh-seed-address-port heartbeat stanza config, should get rid of the logs. Note:Use access-address service config for clients to use the external ips. – Ashish Shinde Jan 29 '19 at 18:20
  • @ahish-shinde, this resolved my issue. Thanks a lot to you, kporter, pgupta who spend time in resolving this issue. I have also added virtual to access-address. access-address 13.x.x.x virtual. – Carbonrock Jan 30 '19 at 07:15
  • @kporter, thanks a lot for your valuable time. – Carbonrock Jan 30 '19 at 07:18

2 Answers2

3

check if the other two nodes are publishing a private ip address not accessible to client and only one node (that went down) is publishing an accessible ip address. (network stanza, service sub-context)

pgupta
  • 5,130
  • 11
  • 8
  • Gupta, Thanks for your apt reply. Yes this is what is happening. But in the configuration file I have provided the public IP addresses. These three nodes are on AWS. Please help. – Carbonrock Jan 24 '19 at 10:13
  • Gupta, the network stanza for all 3 servers are consistent. Updated the Question. – Carbonrock Jan 24 '19 at 12:07
  • try specifying explicit external access-address for each node in service sob-context. service { access-address 13.xxx.xxx.xxx #external ip address of this node ...} – pgupta Jan 24 '19 at 14:40
  • also, can you share your smta namespace stanza - is the storage-engine memory? did you bring two nodes down simultaneously and add them back in? – pgupta Jan 24 '19 at 20:38
  • @Gupta, added access-address and restarted the servers. But still the "found redundant connections to same node" INFO getting thrown continuously. – Carbonrock Jan 25 '19 at 07:16
  • but now can you still write if one node in question goes down? – pgupta Jan 25 '19 at 21:36
  • @Gupta, tested with putting down all the nodes one by one after confirming the sync. Writes were not affected. Also please have a look at the comment I replied to kporter above. Those info and warnings keep coming. Please help. – Carbonrock Jan 28 '19 at 10:05
  • This is getting into troubleshooting something with your deployment. Quick test would be to use latest CE version -- 4.5+? --- and see if it is Aerospike or your network causing the issue. – pgupta Jan 28 '19 at 16:49
  • thanks a lot for your valuable time. Fix given by ahish-shinde in the above thread resolved the issue. – Carbonrock Jan 30 '19 at 07:18
3

The issue is, I have provided NATed ips for heartbeat communication. Ideally we need to provide private IP for "mesh-seed-address-port", provided the "access-address" to NATed IP if your client is outside the network. Please go through the above threads if required.

Here is the clear documentation on how to configure on AWS EC2 instances. https://discuss.aerospike.com/t/aws-ec2-ip-addressing-for-aerospike/2424

Thanks a lot to kporter, pgupta & ashish-shinde for their valuable help.

Carbonrock
  • 457
  • 2
  • 15