1

I'm trying to turn a csv file into sequence files so that I can train and run a classifier across the data. I have a job java file that I compile and then jar into the mahout job jar. And when I try to hadoop jar my job in the mahout jar, I get a java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable. I'm not sure why this is because if I look in the mahout jar, that class is indeed present.

Here are the steps I'm doing

#get new copy of mahout jar
rm iris.jar
cp /home/stephen/home/libs/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar iris.jar    
javac -cp :/home/stephen/home/libs/hadoop-1.0.4/hadoop-core-1.0.4.jar:/home/stephen/home/libs/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar -d bin/ src/edu/iris/seq/CsvToSequenceFile.java    
jar ufv iris.jar -C bin .    
hadoop jar iris.jar edu.iris.seq.CsvToSequenceFile iris-data iris-seq

and this is what my java file looks like

public class CsvToSequenceFile {

public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    String inputPath = args[0];
    String outputPath = args[1];

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Csv to SequenceFile");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(VectorWritable.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path(inputPath));
    SequenceFileOutputFormat.setOutputPath(job, new Path(outputPath));

    // submit and wait for completion
    job.waitForCompletion(true);
}

}

Here is the error in the command line

2/10/30 10:43:32 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/10/30 10:43:33 INFO input.FileInputFormat: Total input paths to process : 1
12/10/30 10:43:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/10/30 10:43:33 WARN snappy.LoadSnappy: Snappy native library not loaded
12/10/30 10:43:34 INFO mapred.JobClient: Running job: job_201210300947_0005
12/10/30 10:43:35 INFO mapred.JobClient:  map 0% reduce 0%
12/10/30 10:43:50 INFO mapred.JobClient: Task Id : attempt_201210300947_0005_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899)
    at org.apache.hadoop.mapred.JobConf.getOutputValueClass(JobConf.java:929)
    at org.apache.hadoop.mapreduce.JobContext.getOutputValueClass(JobContext.java:145)
    at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
    at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:628)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.math.VectorWritable
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891)
    ... 11 more

Any ideas how to fix this or am I even trying to do this process correctly? I'm new to hadoop and mahout, so if I'm doing something the hard way, let me know. Thanks!

wangburger
  • 1,073
  • 1
  • 18
  • 28

3 Answers3

2

This is a very common problem, and almost certainly an issue with the way you are specifying your classpath in the hadoop command.

The way hadoop works, after you give the "hadoop" command, it ships your job to a tasktracker to execute. So, it's important to keep in mind that your job is executing on a separate JVM, with its own classpath, etc. Part of what you are doing with the "hadoop" command, is specifying the classpath that should be used, etc.

If you are using maven as a build system, I strongly recommend building a "fat jar", using the shade plugin. This will build a jar that contains all your necessary dependencies, and you won't have to worry about classpath issues when you add dependencies to your hadoop job, because you are shipping out a single jar.

If you don't want to go this route, have a look at this article, which describes your problem and some potential solutions. in particular, this should work for you:

Include the JAR in the “-libjars” command line option of the hadoop jar … command.

Paul Sanwald
  • 10,899
  • 6
  • 44
  • 59
  • All of the classes the job needs are in the single mahout jar, so I didn't think I needed to specify this, but even putting that jar in the classpath via -libjars didn't change anything. – wangburger Oct 30 '12 at 14:08
  • you mean iris.jar contains everything you need? I just looked at the original command again, and it seems like you are doing some crazy stuff to your jar. you can inspect it with jar -tvf iris.jar, and it will show you the contents. but the exception that you are seeing means that the VectorWritable class is not on the classpath, and that is the root of the issue – Paul Sanwald Oct 30 '12 at 14:22
  • Yeah, the iris.jar is the mahout-core-0.7-job.jar renamed and then my class is added to it. I did that based on the answer here: http://stackoverflow.com/questions/11479600/how-do-i-build-run-this-simple-mahout-program-without-getting-exceptions. And examining the iris.jar shows both my class and VectorWritable. And just for good measure, I put the mahout-core-0.7-job.jar on the classpath, which also contains VectorWritable – wangburger Oct 30 '12 at 14:25
0

Try specifying the classpath explicitly, so instead of hadoop jar iris.jar edu.iris.seq.CsvToSequenceFile iris-data iris-seq try something like java -cp ...

David Soroko
  • 8,521
  • 2
  • 39
  • 51
  • Are you saying get rid of the hadoop jar command entirely and execute the code via java? – wangburger Oct 30 '12 at 14:17
  • Ooops, I misread your command line. No what I mean is that you need to specify the location of the jars explicitly, something like `hadoop jar /path/to/yourjar/iris.jar ...` – David Soroko Oct 30 '12 at 15:11
0

Create jar with dependencies, when you are creating the jar (map/reduce) .

With ref. to maven,you may add the below code in pom.xml and compile the code << mvn clean package assembly:single >> . This will create the jar with depencendcies in target folder and the created jar may look like <>-1.0-SNAPSHOT-jar-with-dependencies.jar

<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</configuration>
</plugin>
</plugins>
</build>

Hopefully this answers your doubt.

Rajiv
  • 3
  • 1