Running a Hadoop Job From another Java Program

Question

I am writing a program that receives the source code of the mapper/reducers, dynamically compiles the mappers/reducers and makes a JAR file out of them. It then has to run this JAR file on a hadoop cluster.

For the last part, I setup all the required parameters dynamically through my code. However, the problem I am facing now is that the code requires the compiled mapper and reducer classes at the time of compiling. But at the time of compiling, I do not have these classes and they will later be received during the run time (e.g. through a message received from a remote node). I would appreciate any idea/suggestion on how to pass this problem?

Here's below you can find the code for my last part with the problem being at job.setMapperClass(Mapper_Class.class) and job.setReducerClass(Reducer_Class.class) requiring the classes (Mapper_Class.class and Reducer_Class.class) files to be present at the time of compiling:

    private boolean run_Hadoop_Job(String className){
try{
    System.out.println("Starting to run the code on Hadoop...");
    String[] argsTemp = { "project_test/input", "project_test/output" };
    // create a configuration
    Configuration conf = new Configuration();
    conf.set("fs.default.name", "hdfs://localhost:54310");
    conf.set("mapred.job.tracker", "localhost:54311");
    conf.set("mapred.jar", jar_Output_Folder+ java.io.File.separator 
                            + className+".jar");
    conf.set("mapreduce.map.class", "Mapper_Reducer_Classes$Mapper_Class.class");
    conf.set("mapreduce.reduce.class", "Mapper_Reducer_Classes$Reducer_Class.class");
    // create a new job based on the configuration
    Job job = new Job(conf, "Hadoop Example for dynamically and programmatically compiling-running a job");
    job.setJarByClass(Platform.class);
    //job.setMapperClass(Mapper_Class.class);
    //job.setReducerClass(Reducer_Class.class);

    // key/value of your reducer output
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(argsTemp[0]));
    // this deletes possible output paths to prevent job failures
    FileSystem fs = FileSystem.get(conf);
    Path out = new Path(argsTemp[1]);
    fs.delete(out, true);
    // finally set the empty out path
    FileOutputFormat.setOutputPath(job, new Path(argsTemp[1]));

    //job.submit();
    System.exit(job.waitForCompletion(true) ? 0 : 1); 
    System.out.println("Job Finished!");        
} catch (Exception e) { return false; }
return true;
}

Revised: So I revised the code to specify the mapper and reducers using conf.set("mapreduce.map.class, "my mapper.class"). Now the code compiles correctly but when it is executed it throws the following error:

ec 24, 2012 6:49:43 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob INFO: Task Id : attempt_201212240511_0006_m_000001_2, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: Mapper_Reducer_Classes$Mapper_Class.class at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:809) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:569) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170)

score 2 · Accepted Answer · answered Dec 23 '12 at 13:41

2

If you don't have them at compile time, then directly set the name in the configuration like this:

conf.set("mapreduce.map.class", "org.what.ever.ClassName");
conf.set("mapreduce.reduce.class", "org.what.ever.ClassName");

answered Dec 23 '12 at 13:41

Thomas Jungblut

20,854
6
68
91

You have to add the `Hadoop` jar to a property called `tmpjars`. So it would work like this: `conf.set("tmpjars", "/usr/local/hadoop/hadoop-core.jar,/usr/local/hadoop/hadoop-example.jar)`. Jar paths must be separated by comma. Note that this is quite hacky, and you have to take care that these jars are actually there on the client machine (in order for Hadoop to copy it to HDFS and download it to the tasktrackers). – Thomas Jungblut Dec 24 '12 at 12:48
thanks Thomas. I figured this part out and my code now compiles correctly. But when during the execution it throws some error. I revised my initial post to reflect this. any idea? – reza Dec 24 '12 at 12:55
Did you explicitly added your jar where the mapper is contained into the `tmpjars`? – Thomas Jungblut Dec 24 '12 at 12:59
I tried that too. here's the options that I tried: 1) I specified the required Jars when calling java (java -cp requierd_JARs) without specifying the conf.set("tmpjars", ...). this worked, job is submitted to Hadoop and it event showed "INFO: map 0% reduce 0%" but it then suddenly throws "java.lang.ClassNotFoundException: Mapper_Reducer_Classes$Mapper_Class.class". org/apache/hadoop/conf/Configuration" – reza Dec 24 '12 at 13:34
2)If I specify the required JARs and also use conf.set("tmpjars",...) it throws the following error as soon as it executes "System.exit(job.waitForCompletion(true) ? 0 : 1);": "java.io.FileNotFoundException: File does not exist: /Users/me/My_Software/hadoop-0.20.2/hadoop-0.20.2-core.jar". I double checked and the file does exist on my system. I think the main reason for this is that it checks that path on HDFS. – reza Dec 24 '12 at 13:35
3)Specifying the conf.set("tmpjars",...) but not using "java -cp JAR_Class_paths" throws the following error as soon as the first hadoop method (configuration conf = ...) is called: "java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration – reza Dec 24 '12 at 13:36
It seems that the first try has the best progress. Do you think there is some problem with how I defined my mapper and reducer? conf.set("mapreduce.map.class", "Mapper_Reducer_Classes$Mapper_Class.class"); conf.set("mapreduce.reduce.class", "Mapper_Reducer_Classes$Reducer_Class.class"); I have a Mapper_Reducer_Classes.java file in which both the Mapper_class and Reducer_Class are defined. do I need to specify the ".class" extension here as well and should I have the "Mapper_Reducer_Classes$" at the beginning? – reza Dec 24 '12 at 13:38
You have to provide the package name before your class – Thomas Jungblut Dec 24 '12 at 13:59
It is defined in the default package (no packaging). In fact, I removed all the packaging to make it easier to debug – reza Dec 24 '12 at 14:03
Then print out the class's full name via java class methods and set it there. Also I think it was the 'libjars' property not 'tmpjar'. Sorry have no source code on my tablet :/ – Thomas Jungblut Dec 24 '12 at 14:59
thanks thomas. could you please elaborate on what you mean by "print out the class's full name via java class methods and set it there" – reza Dec 24 '12 at 18:35
thanks again, I finally got it work: I removed the .class from the end of conf.set() and puts back the packaging in to it. and it now works:) conf.set("mapreduce.map.class", "org.mypackage.Mapper_Reducer_Classes$Mapper_Class"); conf.set("mapreduce.reduce.class", "org.mypackage.Mapper_Reducer_Classes$Reducer_Class"); – reza Dec 24 '12 at 19:03
You can get the full and correct naming of a class by using `Mapper_Reducer_Classes.class.getName()`. So you can always see what needs to be configured there. Glad it works for you now. – Thomas Jungblut Dec 25 '12 at 10:52
thanks. I tried to follow this and deleted my compiled classes and added the Mapper_Reducer_Classes.class.getName(). Unfortunately, that broke my code again and I am no longer able to fix it. I think, I have some issue in the way I am referencing the classes. I am back to the initial problem and would appreciate any idea. I created a more detailed explanation of the problem and posted it as a new post: http://stackoverflow.com/questions/14071496/dynamically-compiling-and-running-a-hadoop-job-from-another-java-file – reza Dec 28 '12 at 15:09

score 1 · Answer 2 · answered Aug 26 '13 at 02:37

The problem is that TaskTracker cannot see classes in your local jRE.

I figured it out in this way(Maven project);

First, add this plugin to pom.xml, it will build your application jar file including all the dependency jars,

 <build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
                <finalName>sample</finalName>
                <!-- 
                <finalName>uber-${artifactId}-${version}</finalName>
                -->
            </configuration>
        </plugin>
    </plugins>
  </build>

in the java source code, add these lines, it will include your sample.jar built to target/sample.jar by tag above in pom.xml.

      Configuration config = new Configuration();
      config.set("fs.default.name", "hdfs://ip:port");
      config.set("mapred.job.tracker", "hdfs://ip:port");

      JobConf job = new JobConf(config);
      job.setJar("target/sample.jar");

in this way, your tasktrackers can refer to classes you wrote and ClassNotFoundException will not happen.

This is the best answer. You may not want to use a shaded jar that contains EVERYTHING needed for the hadoop job and keep all of that stuff on the classpath for the external java program. There may be jar conflicts or other issues. Referencing the shaded jar via path allows it to be abstracted from the external program and be sent to the hadoop cluster via built-in APIs. You can build a different jar that is used by the external program containing only the specific dependencies needed for that program. — Galuvian, Jan 06 '14 at 19:58

score 0 · Answer 3 · answered Dec 23 '12 at 14:24

0

You only need a reference to the Class object for the class that will be dynamically created. Use Class.for name("foo.Mapper") instead of foo.Mapper.class

answered Dec 23 '12 at 14:24

Sean Owen

66,182
23
141
173

Running a Hadoop Job From another Java Program

3 Answers3

Linked