First of all, I've seen this is a recurrent question on stackoverflow, for instance:
Not able to run Hadoop job remotely
Running Hadoop Job Remotely
Exception while submitting a mapreduce job from remote system
Running java hadoop job on local/remote cluster
Nevertheless, I've not seet yet a complete example about how to remotely submit a job:
- When the remote cluster uses MRv2/YARN
- And it is kerberized
Ii must be said such a cluser works perfectly when submitting jobs by means of the yarn
command (previously getting the Kerberos TGT by using kinit
).
Regarding the first issue, I've seen the code for a MRv1 job is something like:
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://" + host + ":8020");
conf.set("mapred.job.tracker", "hdfs://" + host + ":8021");
Job job = new Job(conf, "job-name");
job.setJarByClass(Main.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
Nevertheless, when using MRv2/YARN there is no Jobtracker. Googling a bit I've seen part of the changes must be made in order the above code works with YARN; specifically:
// instead of conf.set("mapred.job.tracker", "hdfs://" + host + ":8021");
conf.set("mapreduce.framework.name", "yarn");
conf.set("yarn.resourcemanager.address", "host:9001");
Is that right? What else is needed? Is still needed the conf.set("fs.default.name", "hdfs://" + host + ":8020");
configuration part? What about setting the hadoop.job.ugi
?
I've also seen that it is possible to put the core-default.xml
and the core-site.xml
files in the classpath in order their properties are loaded (as stated in the Configuration
javadoc). Is this just an alternative to the programmatically settings of the code above, or are both things necessary?
Regarding the second issue, Kerberos, I've seen it is necessary to perform a privileged action, but how does that code look like? I've tried something like:
final Configuration conf = new Configuration();
conf.set("...", "..."); // the settings below
UserGroupInformation.setConfiguration(conf);
UserGroupInformation ugi = UserGroupInformation.createProxyUser("user1", UserGroupInformation.getCurrentUser());
ugi.setAuthenticationMethod(AuthenticationMethod.KERBEROS);
ugi.doAs(new PrivilegedExceptionAction<Void>() {
@Override
public Void run() throws Exception {
Job job = Job.getInstance(conf, "job-name");
...
}
});
But it is not working since I think I have to fix the remote submission part first.
Any hints would be really appreciated. Thanks!