Converting PDF file to text on HDFS (JAVA)

Question

In this, I overewrite class PdfInputFormat with FileInputFormat class. This class is returning object of PdfRecordReader class which is doing all PDF conversion. I am facing an error here.

I am creating the jar in Eclipse by going to :

Tool > Eclipse - Method of exporting > export > create jar.

I am selecting the package required libraries in the jar.

I am executing the jar using the following command:

hadoop jar /home/tcs/converter.jar com.amal.pdf.PdfInputDriver /user/tcs/wordcountfile.pdf /user/convert

After running this I get the following exception:

17/06/09 09:26:51 WARN mapred.LocalJobRunner: job_local1466878685_0001
java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549)
Caused by: java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:548)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:383)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:372)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:61)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:552)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:248)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
at com.amal.pdf.PdfRecordReader.initialize(PdfRecordReader.java:43)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.fontbox.cmap.CMapParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 21 more
17/06/09 09:26:52 INFO mapreduce.Job: Job job_local1466878685_0001 failed with state FAILED due to: NA
17/06/09 09:26:52 INFO mapreduce.Job: Counters: 0
false

Here is the code:

PdfRecordReader class(code)
package com.amal.pdf;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfRecordReader extends RecordReader<Object, Object> 
    {
    private String[] lines = null;
    private LongWritable key = null;
    private Text value = null;
    @Override
    public void initialize(InputSplit genericSplit, TaskAttemptContext context)
            throws IOException, InterruptedException {
        FileSplit split = (FileSplit) genericSplit;
        Configuration job = context.getConfiguration();
        final Path file = split.getPath();
        /*
         * The below code contains the logic for opening the file and seek to
         * the start of the split. Here we are applying the Pdf Parsing logic
         */
        FileSystem fs = file.getFileSystem(job);
        FSDataInputStream fileIn = fs.open(split.getPath());
        PDDocument pdf = null;
        String parsedText = null;
        PDFTextStripper stripper;
        pdf = PDDocument.load(fileIn);
        stripper = new PDFTextStripper();
    //getting exception because of this line****
        parsedText = stripper.getText(pdf);
        this.lines = parsedText.split("\n");    }
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (key == null) {
            key = new LongWritable();
            key.set(1);
            value = new Text();
            value.set(lines[0]);
        } else {
            int temp = (int) key.get();
            if (temp < (lines.length - 1)) {
                int count = (int) key.get();
                value = new Text();
                value.set(lines[count]);
                count = count + 1;
                key = new LongWritable(count);
            } else {
                return false;
            }
        }
        if (key == null || value == null) {
            return false;
        } else {
            return true;
        }
    }
    @Override
    public LongWritable getCurrentKey() throws IOException,
            InterruptedException {
        return key;
    }
    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }
    @Override
    public void close() throws IOException {
    }
}

//Note: Since it is for HADOOP environment, using eclipse will not make //runnable JAR for this project. // Is there anyway to export this project as a runnable JAR.

//Need help to understand what I am doing wrong.

Make sure that fontbox is in your project. With the same version as pdfbox. The latest version is 2.0.6. — Tilman Hausherr, Jun 09 '17 at 07:23

score 0 · Answer 1 · answered Jun 11 '17 at 08:08

0

The error is because hadoop could not find org.apache.fontbox.cmap.CMapParser class which should be an external library that you have imported in your code.

The external dependent jar was not packaged with the jar you used for hadoop command and thus hadoop system couldn't find the jar in hdfs. This is because when we run hadoop command codes (jars) get distributed to where data lies in hdfs cluster and thus the dependent jar was not found.

There are two solutions you can follow:
1 ) you can include the external jars with hadoop command as

hadoop jar /home/tcs/converter.jar com.amal.pdf.PdfInputDriver -libjars <path to external jars comma separated> /user/tcs/wordcountfile.pdf /user/convert

2) or you can use shade plugin and create a uber jar by including all dependent libraries inside your own jar.

answered Jun 11 '17 at 08:08

Ramesh Maharjan

41,071
6
69
97

Thank you so much. – shubham Jun 12 '17 at 04:13
Did you get it solved? – Ramesh Maharjan Jun 12 '17 at 04:44
Yes.it is solved. – shubham Jun 12 '17 at 06:53
An acceptance should help me too and an upvote when you would be eligible :) – Ramesh Maharjan Jun 12 '17 at 06:59
@shubham press the green checkmark. – Tilman Hausherr Jun 12 '17 at 13:42

Converting PDF file to text on HDFS (JAVA)

1 Answers1