3

we have ambari cluster , HDP version 2.6.5

cluster include management of two name-node ( one is active and the secondary is standby )

and 65 datanode machines

we have problem with the standby name-node that not started and from the namenode logs we can see the following

2021-01-01 15:19:43,269 ERROR namenode.NameNode (NameNode.java:main(1783)) - Failed to start namenode.
java.io.IOException: There appears to be a gap in the edit log.  We expected txid 90247527115, but got txid 90247903412.
        at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:838)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:693)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:289)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1073)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:723)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:697)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:761)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1001)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:985)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1710)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1778)

for now the active namenode is up but the standby name node is down

enter image description here

regarding to

java.io.IOException: There appears to be a gap in the edit log.  We expected txid 90247527115, but got txid 90247903412.

what is the preferred solution to fix this problem?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
jessica
  • 2,426
  • 24
  • 66

1 Answers1

2

There are many causes for this, However, check this article this should help.

Follow exact steps in exact orders mentioned in article.

In short the error means namenode matadata is damaged/corrupted.

rikamamanus
  • 811
  • 3
  • 19
  • if the edit logs of fsimage is damaged then we should run the command - hadoop namenode -recover , what you think about? – jessica Jan 21 '21 at 05:52