3

we have some issue with ambari-metrics-collector service , ( we have HDP cluster version - 2.6.4 with 8 nodes )

ambari metrics collector service can’t start or start of few second then failed

enter image description here

the details about metrics collector version

rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64

all machines are rhel 7.2

we performed the following steps in order to resolve the problem

1.restart metrics-collector service

su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'

or

ambari-metrics-collector stop 
ambari-metrics-collector start

2.restart ambari-metrics-monitor on all nodes

 ambari-metrics-monitor stop
 ambari-metrics-monitor start

3.clean the folder /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/

mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/

Then restart metrics-collector service

4.Tuning the metrics-collector parameters according - https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html

we update the follwing parameters in ambari

metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128

status for now: - steps 1-4 doesn’t help

From the logs we can see the following:

log file - ambari-metrics-collector.log

2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server

log file - hbase-ams-master-master02.sys671.com.log

2020-06-25 09:38:18,799 WARN  [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN  [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

we also not see that port is listening ( timeline.metrics.service.webapp.address )

netstat -tulpn  | grep  6188

any advice how to continue from this point ?

we'll appreciate to get any help about this problem

jessica
  • 2,426
  • 24
  • 66
  • Does the Ambari Metrics node have enough free resources to take advantage of bumping up the memory settings? What is the condition of that node? How many services, how much ram, how much free ram, etc... – steven-matison Jun 25 '20 at 12:22
  • Also error looks like zookeeper connection failed? You may want to inspect zookeeper log during investigation of the root cause. – steven-matison Jun 25 '20 at 12:24
  • on the machine that metrics-collector is installed we have free memory - 24G , so this isnt the problem – jessica Jun 25 '20 at 12:44
  • I looked on the zookeeper server logs and not seen any errors – jessica Jun 25 '20 at 12:45
  • any hint how to continue ? – jessica Jun 25 '20 at 14:35
  • the other guy posted same on cloudera, his post had a code snippet that said zk connection issues to ams-hbase... you have to keep following trail to find the culprit – steven-matison Jun 25 '20 at 15:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216652/discussion-between-jessica-and-steven-dfheinz). – jessica Jun 25 '20 at 15:53
  • How did you end up fixing this issue? – runr May 30 '22 at 16:33

0 Answers0