7

When I stop nodes of my replica set and start them up again, the primary node goes into status "recovering".

I have a replica set created, running without authorization. In order to use authorization I have added users "db.createUser(...)", and enabled authorization in the configuration file:

security:
   authorization: "enabled"

Before stopping replica set (even restarting cluster without adding security params), rs.status() shows:

{
        "set" : "REPLICASET",
        "date" : ISODate("2016-09-08T09:57:50.335Z"),
        "myState" : 1,
        "term" : NumberLong(7),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.167:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 301,
                        "optime" : {
                                "ts" : Timestamp(1473328390, 2),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T09:53:10Z"),
                        "electionTime" : Timestamp(1473328390, 1),
                        "electionDate" : ISODate("2016-09-08T09:53:10Z"),
                        "configVersion" : 1,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.168:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 295,
                        "optime" : {
                                "ts" : Timestamp(1473328390, 2),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T09:53:10Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T09:57:48.679Z"),
                        "lastHeartbeatRecv" : ISODate("2016-09-08T09:57:49.676Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "192.168.1.167:27017",
                        "configVersion" : 1
                },
                {
                        "_id" : 2,
                        "name" : "192.168.1.169:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 295,
                        "optime" : {
                                "ts" : Timestamp(1473328390, 2),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T09:53:10Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T09:57:48.680Z"),
                        "lastHeartbeatRecv" : ISODate("2016-09-08T09:57:49.054Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "192.168.1.168:27017",
                        "configVersion" : 1
                }
        ],
        "ok" : 1
}

In order to start using this configuration, I have stopped each node as follows:

[root@n--- etc]# mongo --port 27017 --eval 'db.adminCommand("shutdown")'
MongoDB shell version: 3.2.9
connecting to: 127.0.0.1:27017/test
2016-09-02T14:26:15.784+0200 W NETWORK  [thread1] Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2016-09-02T14:26:15.785+0200 E QUERY    [thread1] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed :
connect@src/mongo/shell/mongo.js:231:14

After this shutdown, I have confirmed that the process does not exist by checking the output from ps -ax | grep mongo.

But when I start the nodes again and log in with my credentials, rs.status() indicates now:

{
        "set" : "REPLICASET",
        "date" : ISODate("2016-09-08T13:19:12.963Z"),
        "myState" : 3,
        "term" : NumberLong(7),
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.167:27017",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",
                        "uptime" : 42,
                        "optime" : {
                                "ts" : Timestamp(1473340490, 6),
                                "t" : NumberLong(7)
                        },
                        "optimeDate" : ISODate("2016-09-08T13:14:50Z"),
                        "infoMessage" : "could not find member to sync from",
                        "configVersion" : 1,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.168:27017",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T13:19:10.553Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                },
                {
                        "_id" : 2,
                        "name" : "192.168.1.169:27017",
                        "health" : 0,
                        "state" : 6,
                        "stateStr" : "(not reachable/healthy)",
                        "uptime" : 0,
                        "optime" : {
                                "ts" : Timestamp(0, 0),
                                "t" : NumberLong(-1)
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2016-09-08T13:19:10.552Z"),
                        "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                        "pingMs" : NumberLong(0),
                        "authenticated" : false,
                        "configVersion" : -1
                }
        ],
        "ok" : 1
}

Why? Perhaps the shutdown is not a good way to stop mongod; however I also tested using 'kill pid', but the restart ends up in the same state.

In this status I don´t know how to repair the cluster; I have started again (removing the dbpath files and reconfiguring the replica set); I tried '--repair' but has not worked.

Info about my system:

  • Mongo version: 3.2
  • I start the process as root, perhaps it should be as 'mongod' user?
  • This is my start command: mongod --conf /etc/mongod.conf
  • keyFile configuration does not work; if I add "--keyFile /path/to/file" shows:
    "about to fork child process, waiting until server is ready for connections." this file has all permissions, but it cannot use keyFile.
  • An example of the "net.bindIp" configuration, from mongod.conf on one machine:

    net:
      port: 27017
      bindIp: 127.0.0.1,192.168.1.167
    
MrElephant
  • 302
  • 4
  • 26
  • What did you do after enabling authentication? How your replica set members try to authenticate their membership to their respective replica set? After enabling authentication, you can't connect to an instance without appropriate credentials, unless you're connecting from localhost – Ali Dehghani Sep 02 '16 at 09:16
  • Did you try this https://docs.mongodb.com/manual/core/security-internal-authentication/ – Ali Dehghani Sep 02 '16 at 09:19
  • After enabling authentication in all config files I start replicaset (with mongod --config /etc/mongod.conf) and then I access with my credentials that before turning off the cluster I inserted. keyfile authentication its optional, not is my issue. Only I want acces by user/passs – MrElephant Sep 02 '16 at 09:35
  • 1
    You should add some extra configuration, because not only clients need to be able to authenticate with the replica set, but replica set nodes also need to be able to authenticate with each other. So, each replica set node authenticates itself with the others as a special internal user with enough privileges – Ali Dehghani Sep 02 '16 at 09:43
  • For example you can add a key file to each replica set member and put the password of admin user in it. Then start the mongod instance with `--keyFile /path/to/keyfile` arg. – Ali Dehghani Sep 02 '16 at 09:45
  • If I add --keyFile /path/to/keyfile mongo does not initiate, it shows a message "about to fork child process, waiting until server is ready for connections." I must initiate without this property – MrElephant Sep 02 '16 at 12:45
  • Accordind to your log, the "stop" command fails, so we cannot say just _anything_ you said you did is true. Thus it's impossible to answer your question due to no hard evidence on anything. Follow the docs and return if/when you have anything more concrete to show. – ivan_pozdeev Sep 02 '16 at 14:28
  • then, whats your suggestion about 'shutdown' command? after executing shutdown command process mongod stop. – MrElephant Sep 03 '16 at 08:43
  • Can you show the full status of your replica set? You get it with the command [rs.status()](https://docs.mongodb.com/manual/reference/method/rs.status/). – Vince Bowdren Sep 07 '16 at 16:11
  • Youe - the replica set status you have posted says everything is healthy. Does that mean it has been repaired and is now fine? Was it simply that the nodes took a while to go through the Recovering state to achieve full health? – Vince Bowdren Sep 08 '16 at 10:31
  • Before stoping cluster rs.status() is OK, but when I stop/start the mongod services rs.status() shows another state. – MrElephant Sep 08 '16 at 13:23
  • That last status makes it look like _only one_ node is running at all. In that case, there aren't enough nodes to hold an [election](https://docs.mongodb.com/manual/core/replica-set-high-availability/) so the replica set [will not be available](https://docs.mongodb.com/manual/tutorial/troubleshoot-replica-sets/). What happens when you reboot a second node? Do they manage to contact each other, hold an election and re-establish the replica set in full health? – Vince Bowdren Sep 08 '16 at 13:33
  • Second node is active in that state, but it is not reachable. Maybe because when I relaunch mongod (with security params) they cannot comunicate between them – MrElephant Sep 08 '16 at 14:26
  • So if you restart the nodes _without changing any configuration_ do they reinstate the replica set successfully? – Vince Bowdren Sep 08 '16 at 16:17
  • yes, if there is not configuration changes I can start and all states are ok. I writted solution down. – MrElephant Sep 08 '16 at 20:36
  • Sounds like there never was a problem with your replica set then; it was a problem with your authentication after all. – Vince Bowdren Sep 09 '16 at 14:06

3 Answers3

2

finally I resolved the problem, for a cluster replica set is MANDATORY a keyFile to communicate all nodes, when I indicated keyFile it returns error because in mongod.log indicated :

I ACCESS   [main] permissions on /etc/keyfile are too open

keyfile must have 400 as permission. Thanks @Saleem

When people says "You can add keyfile" I was thinking as an optional param but it is mandatory.

MrElephant
  • 302
  • 4
  • 26
1

Note: This solution is Windows specific but can be ported to *nix based systems easily.

You'll need to take steps in sequence. First of all, start your mongod instances.

start "29001" mongod --dbpath "C:\data\db\r1" --port 29001
start "29002" mongod --dbpath "C:\data\db\r2" --port 29002
start "29003" mongod --dbpath "C:\data\db\r3" --port 29003 

Connect with mongo to each node and create an administrator user. I prefer creating super user.

> use admin
> db.createUser({user: "root", pwd: "123456", roles:["root"]})

You may create other users as deemed necessary.

Create key file. See documentation for valid key file contents.

Note: On *nix based systems, set chmod of key file to 400

In my case, I created key file as

echo mysecret==key > C:\data\key\key.txt

Now restart your MongoDB servers with --keyFile and --replSet flags enabled.

start "29001" mongod --dbpath "C:\data\db\r1" --port 29001 --replSet "rs1" --keyFile C:\data\key\key.txt
start "29002" mongod --dbpath "C:\data\db\r2" --port 29002 --replSet "rs1" --keyFile C:\data\key\key.txt
start "29003" mongod --dbpath "C:\data\db\r3" --port 29003 --replSet "rs1" --keyFile C:\data\key\key.txt

Once all mongod instances are up and running, connect any one with authentication.

mongo --port 29001 -u "root" -p "123456" --authenticationDatabase "admin"

Initiate replicaset,

> use admin
> rs.initiate()
> rs1:PRIMARY> rs.add("localhost:29002")
{ "ok" : 1 }
> rs1:PRIMARY> rs.add("localhost:29003")
{ "ok" : 1 }

Note: You may need to replace localhost with machine name or IP address.

Saleem
  • 8,728
  • 2
  • 20
  • 34
0

Node should be shut down one at a time, so other secondry member will elect for primary. And it will be in recovery node while syncing to the to the other member. This one by one shutdown will not require to re add the nodes.

Aayushi
  • 9
  • 1
  • 1
  • 5