Calling a mapreduce job from a simple java program

Question

I have been trying to call a mapreduce job from a simple java program in the same package.. I tried to refer the mapreduce jar file in my java program and call it using the runJar(String args[]) method by also passing the input and output paths for the mapreduce job.. But the program dint work..

How do I run such a program where I just use pass input, output and jar path to its main method?? Is it possible to run a mapreduce job (jar) through it?? I want to do this because I want to run several mapreduce jobs one after another where my java program vl call each such job by referring its jar file.. If this gets possible, I might as well just use a simple servlet to do such calling and refer its output files for the graph purpose..

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

/**
 *
 * @author root
 */
import org.apache.hadoop.util.RunJar;
import java.util.*;

public class callOther {

    public static void main(String args[])throws Throwable
    {

        ArrayList arg=new ArrayList();

        String output="/root/Desktp/output";

        arg.add("/root/NetBeansProjects/wordTool/dist/wordTool.jar");

        arg.add("/root/Desktop/input");
        arg.add(output);

        RunJar.main((String[])arg.toArray(new String[0]));

    }
}

Thomas Jungblut · Accepted Answer · 2014-08-14T19:55:13.707

32

Oh please don't do it with runJar, the Java API is very good.

See how you can start a job from normal code:

// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your 
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);

// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);

If you are using an external cluster, you have to put the following infos to your configuration via:

// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001"); 
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");

This should be no problem when the hadoop-core.jar is in your application containers classpath. But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)

For YARN (> Hadoop 2)

For YARN, the following configurations need to be set.

// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001"); 

// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");

// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");

edited Aug 14 '14 at 19:55

answered Mar 24 '12 at 08:02

Thomas Jungblut

20,854
6
68
91

@ThomasJungblut- Well, I didnt understand you.. From what I have understood by the code above, you have tried implementing a mapreduce job where you refer to it the input and output files for the input and output formats you have chosen. And its NOT using only Java API you talked about but have used hadoop library for it. What I wanted to ask is that, if I have a basic java/servlet program and a wordcount mapreduce job, how can i start the wordcount job from this java program without importing any of the hadoop classes? If I am wrong in understanding ur code above then do correct me.. Thanx.. – Ravi Trivedi Mar 24 '12 at 11:34
1

Ah okay, so you have a simple jar on a application server and want to just start it? Why can't you put the hadoop jar into it? – Thomas Jungblut Mar 24 '12 at 11:39
Well, could you explain me the code you have put above?? what exactly I am trying to do is to call a mapreduce job from my web-application. On click events in a webpage i wud want mapreduce jobs to run in background and then get results of it in form of a graph. Leaving the graph part aside, if i were to use a servlet to call mapreduce job and get back the result, how can i do that?? – Ravi Trivedi Mar 25 '12 at 04:41
1

Hey, may i ask do you need to set in configuration some parameters for mapred? How does hadoop-core.jar pick up the setting? I am trying to do this, but i failed.Thank you! – lucky_start_izumi Mar 28 '12 at 16:57
The configuration *.xml must be in the classpath. – Thomas Jungblut Mar 28 '12 at 17:07
@ThomasJungblut : I tried the way you have mentioned above to run a hadoop based MapReduce job from java program. I included the hadoop-core.jar. But then also I kept getting NoClassDefFoundError. It went away only when I included all the lib jars which are present inside the hadoop installation directory along with hadoop-core and hadoop-client jars. This surely is not the way to do it. What am I missing? – Kartikeya Sinha Jul 23 '14 at 06:55
@KartikeyaSinha I don't know, depends on what class is missing. – Thomas Jungblut Jul 23 '14 at 07:13
do we really have to put all the configuration xml files (hadoop-site.xml, hdfs-site.xml, hbase-site.xml (am also dealing with some hbase stuff)) in the classpath? I have to run my project on machine which is not in the cluster and I will be getting cluster information from the json. Cant we set all the required information using `Configuration.set()`? – Mahesha999 May 30 '16 at 12:08
@Mahesha999 you can. And you can also load the configuration from an arbitrary path. – Thomas Jungblut May 30 '16 at 12:40
So you mean I can do this by setting desired params using Configuration.set() and not having xml files on machine running the code? – Mahesha999 May 30 '16 at 13:14
hi again, I know this is bad way to ask for help...but can you just give me a small hint about whats going wrong [here](http://stackoverflow.com/q/37548316/1317018), if you have come across bulk loading data in hbase? – Mahesha999 Jun 06 '16 at 07:15
`fs.default.name` is present in `core-site.xml` not `hdfs-site.xml` – Manoj Suthar Mar 27 '18 at 04:05

score 7 · Answer 2 · answered Nov 18 '13 at 09:02

Calling MapReduce job from java web application (Servlet)

You can call a MapReduce job from web application using Java API. Here is a small example of calling a MapReduce job from servlet. The steps are given below:

Step 1: At first create a MapReduce driver servlet class. Also develop map & reduce service. Here goes a sample code snippet:

CallJobFromServlet.java

    public class CallJobFromServlet extends HttpServlet {

    protected void doPost(HttpServletRequest request,HttpServletResponse response) throws ServletException, IOException {

    Configuration conf = new Configuration();
    // Replace CallJobFromServlet.class name with your servlet class
        Job job = new Job(conf, " CallJobFromServlet.class"); 
        job.setJarByClass(CallJobFromServlet.class);
        job.setJobName("Job Name");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setMapperClass(Map.class); // Replace Map.class name with your Mapper class
        job.setNumReduceTasks(30);
        job.setReducerClass(Reducer.class); //Replace Reduce.class name with your Reducer class
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        // Job Input path
        FileInputFormat.addInputPath(job, new  
        Path("hdfs://localhost:54310/user/hduser/input/")); 
        // Job Output path
        FileOutputFormat.setOutputPath(job, new 
        Path("hdfs://localhost:54310/user/hduser/output")); 

        job.waitForCompletion(true);
   }
}

Step 2: Place all the related jar (hadoop, application specific jars) files inside lib folder of the web server (e.g. Tomcat). This is mandatory for accessing the Hadoop configurations ( hadoop ‘conf’ folder has configuration xml files i.e. core-site.xml , hdfs-site.xml etc ) . Just copy the jars from hadoop lib folder to web server(tomcat) lib directory. The list of jar names are as follows:

1.  commons-beanutils-1.7.0.jar
2.  commons-beanutils-core-1.8.0.jar
3.  commons-cli-1.2.jar
4.  commons-collections-3.2.1.jar
5.  commons-configuration-1.6.jar
6.  commons-httpclient-3.0.1.jar
7.  commons-io-2.1.jar
8.  commons-lang-2.4.jar
9.  commons-logging-1.1.1.jar
10. hadoop-client-1.0.4.jar
11. hadoop-core-1.0.4.jar
12. jackson-core-asl-1.8.8.jar
13. jackson-mapper-asl-1.8.8.jar
14. jersey-core-1.8.jar

Step 3: Deploy your web application into web server (in ’webapps’ folder for Tomcat).

Step 4: Create a jsp file and link the servlet class (CallJobFromServlet.java) in form action attribute. Here goes a sample code snippet:

Index.jsp

<form id="trigger_hadoop" name="trigger_hadoop" action="./CallJobFromServlet ">
      <span class="back">Trigger Hadoop Job from Web Page </span> 
      <input type="submit" name="submit" value="Trigger Job" />      
</form>

score 1 · Answer 3 · answered Feb 10 '13 at 13:43

1

Another way for jobs already implemented in hadoop examples and also it requires hadoop jars being imported.. then just call the static main function of the desired job Class with the appropriate String[] of arguments

answered Feb 10 '13 at 13:43

faridasabry

150
8

score 1 · Answer 4 · edited Nov 26 '15 at 02:36

Because map and reduce run on different machines, all your referenced classes and jars must move from machine to machine.

If you have package jar, and run on your desktop, @ThomasJungblut's answer is OK. But if you run in Eclipse, right click your class and run, it doesn't work.

Instead of:

job.setJarByClass(Mapper.class);

Use:

job.setJar("build/libs/hdfs-javac-1.0.jar");

At same time, your jar's manifest must include Main-Class property, which is your main class.

For gradle users, can put these lines in build.gradle:

jar {
manifest {
    attributes("Main-Class": mainClassName)
}}

score 0 · Answer 5 · edited Nov 26 '15 at 02:36

0

You can do in this way

public class Test {

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new YourJob(), args);
        System.exit(res);

    }

edited Nov 26 '15 at 02:36

Mogsdad

44,709
21
151
275

answered Aug 05 '14 at 12:16

techlearner

109
1
4

The question was: how to run mapreduce jobs from existing Java code. Not from Command Line. This does not answer the question. – Doron Gold Oct 15 '14 at 08:30

score 0 · Answer 6 · answered Mar 24 '12 at 15:29

I can't think of many ways you can do this without involving the hadoop-core library (or indeed like @ThomasJungblut said, why you would want to).

But if you absolutely must, you could set up an Oozie server with a workflow for your job, and then use the Oozie webservice interface to submit the workflow to Hadoop.

Again, this seems like a lot of work for something that could just be resolved using the Thomas's answer (include the hadoop-core jar and use his code snippet)

Calling a mapreduce job from a simple java program

6 Answers6

Linked

Related