I have a simple java program which wraps distcp to copy files over the hadoop clusters. I can run it successfully both from IDE and hadoop cli.
I wanted to have a jsp web application so people could use the web interface in order to interact with my program.
I created a fat jar with all dependencies and deployed it in my web application. Now the problem is that whenever the program wants to submit distcp job it gives the following error:
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:143)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:108)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:101)
at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:419)
at org.apache.hadoop.tools.DistCp.<init>(DistCp.java:106)
at replication.ReplicationUtils.doCopy(ReplicationUtils.java:127)
at replication.ReplicationUtils.copy(ReplicationUtils.java:77)
at replication.parallel.DistCpTask.run(DistCpTask.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I checked mapreduce.framework.name and it is indeed yarn.
any ideas?
UPDATE1:
After some debugging I found that for the following piece of code:
Iterable<ClientProtocolProvider> frameworkLoader =
ServiceLoader.load(ClientProtocolProvider.class);
for(ClientProtocolProvider cpp: frameworkLoader) {
System.out.println(cpp.toString());
}
when I run it locally I get:
org.apache.hadoop.mapred.YarnClientProtocolProvider@7a4f0f29
org.apache.hadoop.mapred.LocalClientProtocolProvider@5fa7e7ff
But when it is run from web server I get:
org.apache.hadoop.mapred.LocalClientProtocolProvider@5fa7e7ff
I cannot still find out why this happens. I have YarnClientProtocolProvider in the fat jar that I deploy in webserver.
UPDATE2:
the uber jar that I create somehow merges all the service provider declarations under META-INF/services directory of the dependency jars and hence the last file which is written there only contains 'org.apache.hadoop.mapred.LocalClientProtocolProvider'.
I am still wondering why when I use
hadoop jar my.jar ....
it recognizes 'org.apache.hadoop.mapred.YarnClientProtocolProvider' although it is not present in the service providers under META-INF/services directory of my.jar.
Now I think the question should be how to create an uber jar which does not merge the service provider entries.