I'm running Apache Ozone with high availability manager nodes on kubernetes. I was able to get the leader auto-election working, but the Hadoop client running on my Accumulo pod won't resolve the new leader. Right now I have one service that shares between the two manager pods.
According to the documentation for Hadoop, you can specify a custom proxy failover provider in hdfs-site.xml using the key:
dfs.client.failover.proxy.provider.[cluster-name]
I guess this is deprecated in the Ozone adapter for Hadoop because even if I put in random characters for that key, there is no related errors. This is a bummer because I know that the fail-over API works differently in Ozone and found a few classes that looked like they would work there:
HadoopRpcOMFailoverProxyProvider, GrpcOMFailoverProxyProvider, and OMFailoverProxyProvider.
I would get this error intermittently because I have two manager nodes running and one kubernetes service routing to both of them:
root@accumulo-monitor-0:/opt/hadoop/etc/hadoop# hdfs dfs -ls /
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/ozone/share/ozone/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2022-09-30 21:17:22,892 INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException): OM:ozone-om-1 is not the leader. Could not determine the leader node.
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:187)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createLeaderErrorException(OzoneManagerProtocolServerSideTranslatorPB.java:174)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:167)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:133)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:123)
at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:466)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
, while invoking $Proxy14.submitRequest over nodeId=null,nodeAddress=ozone-om.equitus:9862 after 1 failover attempts. Trying to failover after sleeping for 4000ms.
After searching online a bit, I noticed that the proxy request does not resolve the nodeId or any of the other nodes. It should look more like this:
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
, while invoking $Proxy10.submitRequest over
{om1=nodeId=om1,nodeAddress=uma-1.uma.root.hwx.site:9862,
om3=nodeId=om3,nodeAddress=uma-3.uma.root.hwx.site:9862,
om2=nodeId=om2,nodeAddress=uma-2.uma.root.hwx.site:9862} after 1 failover
attempts. Trying to failover immediately.{code}
Another error I'm getting when using hdfs command line:
root@accumulo-monitor-0:/opt/hadoop/etc/hadoop# hdfs haadmin -ns ozone-om -getAllServiceState
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/ozone/share/ozone/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
ozone-om-0.ozone-om.equitus:9862 Failed to connect: Unknown protocol: org.apache.hadoop.ha.HAServiceProtocol
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine2.java:498)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:565)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
ozone-om-1.ozone-om.equitus:9862 Failed to connect: Unknown protocol: org.apache.hadoop.ha.HAServiceProtocol
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine2.java:498)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:565)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:552)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1093)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2966)
Here is my configs for Hadoop on the Accumulo pod:
core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>ofs://ozone-om.equitus/</value>
</property>
<property>
<name>fs.ofs.impl</name>
<value>org.apache.hadoop.fs.ozone.RootedOzoneFileSystem</value>
</property>
</configuration>
hdfs-site.xml:
(setting the key dfs.client.failover.proxy.provider.ozone-om does nothing)
<configuration>
<property>
<name>dfs.client.failover.proxy.provider.ozone-om</name>
<value>org.apache.hadoop.ozone.om.ha.HadoopRpcOMFailoverProxyProvider</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider</name>
<value>org.apache.hadoop.ozone.om.ha.HadoopRpcOMFailoverProxyProvider</value>
</property>
<property>
<name>dfs.client.retry.policy.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.ozone-om</name>
<value>true</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ozone-om</value>
</property>
<property>
<name>dfs.ha.namenodes.ozone-om</name>
<value>ozone-om-0,ozone-om-1</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ozone-om.ozone-om-0</name>
<value>ozone-om-0.ozone-om.equitus:9862</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ozone-om.ozone-om-1</name>
<value>ozone-om-1.ozone-om.equitus:9862</value>
</property>
<property>
<name>dfs.namenode.http-address.ozone-om.ozone-om-0</name>
<value>ozone-om-0.ozone-om.equitus:9874</value>
</property>
<property>
<name>dfs.namenode.http-address.ozone-om.ozone-om-1</name>
<value>ozone-om-1.ozone-om.equitus:9874</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.synconclose</name>
<value>true</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Anyone have any ideas? I feel like I'm either missing a proxy configuration or Ozone hasn't supported this yet. I searched very deeply inside the repo and online and found nothing regarding adding a proxy configuration key for this.