Cassandra received an invalid gossip generation for peer

Question

We have a basic 2 node Cassandra cluster. Both nodes run version 3.9 with minimal configurations to enable clustering. One of the nodes is sending wrong generation which causes the other node to show the warning

WARN  [GossipStage:1] Gossiper.java:1146 - received an invalid gossip generation for peer /10.3.185.234; local time = 1479469393, received generation = 1872927836

Node-1 which is causing the issue has this output from

nodetool gossipinfo

/10.3.185.234

generation: 1872927836

1872927836 epoch is a far away date (Tue, 08 May 2029 09:43:56 GMT). Node 2 is legitimately discarding the packet. Can I somehow fix node-1 to send the correct generation?

score 5 · Accepted Answer · answered Nov 18 '16 at 13:10

5

Fixed the issue by changing the gossip_generation value in system.local table by using cqlsh

update system.local  set gossip_generation = 1479472176 where key='local';

Restart the service after this change

answered Nov 18 '16 at 13:10

Arslan Mehboob

1,012
1
9
21

1

hey @arslan, looks like restarting the service puts the new timestamp back into the table. :( – Shriram Sharma Jun 26 '18 at 18:02

score 3 · Answer 2 · answered Jul 07 '19 at 05:09

we use scylla(version = 2.1.3-0.20180501.8e33e80ad) & we were able to workaround this without entire cluster reboot.

Our cluster recently had been losing nodes , because those nodes restarted & were not being allowed to join the cluster during gossip phase of bootup. The reason being: nodes which where status=UN (up & normal) were spewing the following error & not allowing those affected nodes to join the cluster during the gossip phase. In our case the error msg was:

Jul 04 01:54:17 host-10.3.7.77 scylla[30263]: [shard 0] gossip - received an invalid gossip generation for peer 10.3.7.7; local generation = 1526993447, received generation = 1562158865

Now lets go into the details & context of the above error message:

Every node is configured with a list of seeds, to which it tries to gossip & gather cluster info during bootup.
On bootup it creates a ‘generation’ number(generation number is a epoch) which it shares with the seed host during gossip.

gossiper.register(this->shared_from_this());

auto generation_number=db::system_keyspace::increment_and_get_generation().get0();

_gossiper.start_gossiping(generation_number, app_states, gms::bind_messaging_port(bool(do_bind))).get();

The node on its FIRST ever bootup sends its generation number to the seed & seed gossips with others to pass on the info. The seed stores this generation number as a reference. This is referred to as local_generation term referred in the above error message i.e. the UN node 10.3.7.77 was saying that peer 10.3.7.7 was sending a generation number 1562158865 (i.e. referred to as receive_generation) but it had stored as reference 1526993447. You will notice that 1526993447 refers to epoch may 22, 2018 & 1562158865 refers to july 3, 2019 epoch, i.e. the node 10.3.7.7 first started on may 22, 2018 & had sent its generation number as 1526993447.
Since the difference between the 2 epochs is greater than 1 year the UN node will refuse to allow the other node to join

int64_t MAX_GENERATION_DIFFERENCE = 86400 * 365;

if (local_generation > 2 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) { // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)

logger.warn("received an invalid gossip generation for peer ..... }

Now during bootup , the logic for increment_and_get is:

auto req = format("SELECT gossip_generation FROM system.{} WHERE key='{}'", LOCAL, LOCAL);
return qctx->qp().execute_internal(req).then([] (auto rs) {
    int generation;
    if (rs->empty() || !rs->one().has("gossip_generation")) {
        // seconds-since-epoch isn't a foolproof new generation
        // (where foolproof is "guaranteed to be larger than the last one seen at this ip address"),
        // but it's as close as sanely possible
        generation = service::get_generation_number();
    } else {
        // Other nodes will ignore gossip messages about a node that have a lower generation than previously seen.
        int stored_generation = rs->one().template get_as<int>("gossip_generation") + 1;
        int now = service::get_generation_number();
        if (stored_generation >= now) {
            slogger.warn("Using stored Gossip Generation {} as it is greater than current system time {}."
                        "See CASSANDRA-3654 if you experience problems", stored_generation, now);
            generation = stored_generation;
        } else {
            generation = now;
        }
    }
    auto req = format("INSERT INTO system.{} (key, gossip_generation) VALUES ('{}', ?)", LOCAL, LOCAL);

From the above logic, the server first looks up the generation number from the system.local table. if the value is empty , it generates a fresh number i.e. current time as the logic for generating a generation number depends solely on current time. if its not empty, it compares with current time & uses the larger value i.e. the more recent time & writes it back to system.local table

int get_generation_number() { .... auto now = high_resolution_clock::now().time_since_epoch(); int generation_number = duration_cast(now).count(); ....}

So the generation-number generated & sent by node to seed on bootup is always in general closer to current time but the generation number stored by the seed UN node as a local reference does not change.
To entire avoid cluster reboot: We adopted this approach in production based on the code logic explained above.

-- The root problem was that the local generation of the problematic node stored in the UN seed node was not changing. (but the problematic node on every reboot sends a fresh gen number which is closer to current time)

-- IDEA : Lets update the local generation of the problematic node stored in the UN node so that the the remote-generation number being sent by the problematic node would fall within 1 year.

-- So how we update this value in the UN seed node ? we need to make the problematic node send a gen number (epoch) with a value which falls in 1 year window of the local gen number stored in UN seed node. but since the code always takes current time as gen number and current time is july 2019, what can we do ?

-- We change TIME back on the problematic node to a value which falls within 1 year of 1526993447. choose a epoch value towards the end of 1 year window, i.e change system time to a value say march 31, 2019 i.e. epoch 1554030000 rather than october 2, 2018 & restart the node. the node will reboot & send a gen number 1554030000(as it looks up system.local table) or current time which is march 31, 2019 anyway to the seed.

-- The UN seed node gets this value & validates that the remote-generation number sent by problematic node is within 1 year of may22,2018 so, it proceeds to update its reference (local generation).

else if (remote_generation > local_generation) { logger.trace("Updating heartbeat state generation to {} from {} for {}", remote_generation, local_generation, ep); // major state change will handle the update by inserting the remote state directly this->handle_major_state_change(ep, remote_state); } ....

-- we have successfully updated the reference (local gen) of the problematic node stored in the UN seed node. -- Now we stop problematic node, reset the time back on the problematic node to current time & reboot, the problematic node will send latest epoch of say july 4, 2019 i.e epoch 1562215230 -- Now after time is reset & rebooted, since 1562215230 (gen sent be problematic node using latest time) minus 1554030000 (local reference stored in UN seed node) < 1 year, the problematic node will be allowed to join the cluster .

-- we advise you to choose an epoch/date towards end of the 1 year window but within 1 year, the later the better as a new 1 year window starts from the date you choose & this problem is mitigated for that long LOL – YEP this problem occurs on long running clusters. What this means is that you need to do rolling restarts once in a while every year to extend the 1 year window.

Here are the steps for the procedure:

Steps:

if the problematic node is 10.3.7.7 and error is reported on say 10.3.7.77 (UN node), make sure the seed for 10.3.7.7 is 10.3.7.77 so that we guarantee its talking to this node and we dont have to search to find out who its talking too in the cluster. if the seed for 7.7 node is different from the node reporting the error, then look at the error message printed by the seed node to decide which epoch to reset too. in our case, since i saw the error on 7.77, i changed the seed of 7.7 to 7.77 node.
start the problematic node.
the seed node should start printing the error. capture the error message for our node and make note of local gen number for us to choose a date to reset too. in our case the msg was as follows:

Jul 04 01:54:17 host-10.3.7.77 scylla[30263]: [shard 0] gossip – received an invalid gossip generation for peer 10.3.7.7; local generation = 1526993447, received generation = 1562158865

cqlsh to problematic node 10.3.7.7 & update generation number to an epoch within 1 year of 1526993447 BUT choose an epoch towards the end of the 1 year window like 1554030000 (march 31, 2019) rather than say july/october 2018 so that you have a longer new 1 year window.
on problematic node, run the commands

5.1 'update system.local set gossip_generation = 1554030000 where key='local';'

5.2 'nodetool flush'
stop problematic node
edit the config file & change CQL (native_transport_port) from 9042 to 9043 so that clients cant connect & insert data – insertion of data during this phase will set records with a timestamp of march 2019 which is not correct i.e. prevent data corruption. This is a precaution
change system time i.e. “date -s ’31 MAR 2019 11:03:25′”
verify system time has changed by running date command
start the problematic node & tail logs of the UN seed node, the error should go away.
wait for some time (few minutes is enough) for gossip to occur & verify if problematic node is now UN.
Run command 'nodetool status' on another node to check if its UN.
You can tail logs of the UN seed node & check if you still get the error. if you do see the error again – repeat the steps again from the start. u mite have missed something.
once the node is declared UN:

14.1 shutdown the node

14.2 change the CQL (native_transport_port) back from 9043 to 9042 in config file.

14.3 RESET THE system time on the box

14.4 verify system time is back to normal
start up the node once you have changed back time and port. and the node should still be UN.

Confessions:

yes we did this exercise in production. The node was anyways considered dead, hence risk was minimal as screwing up a dead node even more wont make a difference and if the procedure failed, we would sacrifice 1 node only and hence be left with the only option to cluster reboot.
we scanned the scylla code base of master branch for usages of system time in cluster communication and found only 2 places which gave us confidence that changing system time would work. also by changing CQL port to 9043, we eliminated any contamination of existing data by clients.

Morals of the story:

this happened in 2.1 version of scylla, and as of today july 4, 2019 the master branch of scylla still has the same code logic, hence this can occur in version 3 and above too. 2 .every few months better do rolling restarts of the nodes, so that the nodes send a new gen number for gossip and the 1 year window is extended.
if you have a long running cluster > 1 year , if a node is restarted, it will be affected by this error, the more node restarts that happen, the more the epidemic spreads.
this can work for cassandra if the code logic is the same, which i think it is.

References:

https://github.com/scylladb/scylla/blob/134b59a425da71f6dfa86332322cc63d47a88cd7/gms/gossiper.cc

https://github.com/scylladb/scylla/blob/94d2194c771dfc2fb260b00f7f525b8089092b41/service/storage_service.cc

https://github.com/scylladb/scylla/blob/077c639e428a643cd4f0ffe8e90874c80b1dc669/db/system_keyspace.cc

you can also find the above explanation/fix details on my blog https://mash213.wordpress.com/2019/07/05/scylla-received-an-invalid-gossip-generation-for-peer-how-to-resolve/

score 1 · Answer 3 · answered Jan 05 '18 at 20:10

I don't know what affect setting it manually has but another way to fix it is to restart the whole cluster at once. This probably would have worked for us. (Found this after our fix).

The docs say to do multiple rolling restarts until it's fixed (Which didn't work for us.) The sledgehammer of restarting the whole cluster did work however. All the generations were set appropriately.

Cassandra received an invalid gossip generation for peer

3 Answers3