How to extract text from a PDF file with Apache PDFBox

Question

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(filepath);

PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);

However, I got the following error:

Exception in thread "main" java.lang.NullPointerException
at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)

I added pdfbox-1.8.5.jar and fontbox-1.8.5.jar to the class path.

Edit

I added System.out.println("program starts"); to the beginning of the program.

I ran it, then I got the same error as mentioned above and program starts did not appear in the console.

Thus, I think I have a problem with the class path or something.

Thank you.

Probably your PDF file is not completely valid and makes PDFBox stumble. You might want to supply the PDF for inspection. — mkl, May 23 '14 at 06:27
Are you sure you start the correct `main()` method? The exception looks like you start the `main()` of `org.apache.fontbox.afm.AFMParser` which looks like PDFBox code, not your code. — mkl, May 23 '14 at 09:24
You're right. I reset the run configuration and now the program works. Thank you very much, mkl. — Benben, May 23 '14 at 11:12

Matthias Braun · Answer 1 · 2018-03-02T10:03:56.417

53

Using PDFBox 2.0.7, this is how I get the text of a PDF:

static String getText(File pdfFile) throws IOException {
    PDDocument doc = PDDocument.load(pdfFile);
    return new PDFTextStripper().getText(doc);
}

Call it like this:

try {
    String text = getText(new File("/home/me/test.pdf"));
    System.out.println("Text in PDF: " + text);
} catch (IOException e) {
    e.printStackTrace();
}

Since user oivemaria asked in the comments:

You can use PDFBox in your application by adding it to your dependencies in build.gradle:

dependencies {
  compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
}

Here's more on dependency management using Gradle.

If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

edited Mar 02 '18 at 10:03

answered Aug 06 '16 at 17:13

Matthias Braun

32,039
22
142
171

1

This is better than the accepted answer. I used the same to get the resource as InputStream to load the file from `src\resources` folder. You can also use maven dependency from m2repo https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox – Lucky Nov 30 '16 at 13:44
1

The PPDocument need to be closed after usage. – DKMDebugin Feb 05 '21 at 15:52

score 35 · Accepted Answer · edited May 17 '19 at 11:04

35

I executed your code and it worked properly. Maybe your problem is related to FilePath that you have given to file. I put my pdf in C drive and hard coded the file path. Here is my code:

// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
// import org.apache.pdfbox.io.RandomAccessFile;

public class PDFReader{
    public static void main(String args[]) throws IOException {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        File file = new File("C:/my.pdf");
        PDFParser parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        try (COSDocument cosDoc = parser.getDocument()) {
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        }
    }
}

edited May 17 '19 at 11:04

centic

15,565
9
68
125

answered May 22 '14 at 18:53

Emad

769
9
21

Its working fine when we get pdf file from computer, But I am trying to get it from SD card in android then it giving error like "java.lang.ClassNotFoundException: Didn't find class "java.awt.print.Printable" on path: DexPathList[[zip file "/data/app/com.geeklabs.pdfreader-1/base.apk"],nativeLibraryDirectories=[/vendor/lib, /system/lib]]" – Shailendra Madda Apr 24 '15 at 07:30
And also getting "java.lang.NoClassDefFoundError: org.pdfbox.pdmodel.PDDocument" even though adding libs to build path – Shailendra Madda Apr 24 '15 at 08:10
How is PDFbox used? I'm new to this concept but have no idea where to begin. I've downloaded the jar file but double clicking it doesnt work. – oivemaria Jul 25 '15 at 03:25
7

With pdfbox 2.0.5 this code does not compile with error: java.io.FileInputStream cannot be cast to org.apache.pdfbox.io.RandomAccessRead – Asu May 01 '17 at 15:23
2

The constructor PDFParser(FileInputStream) is undefined cast to org.apache.pdfbox.io.RandomAccessRead given error – Walid Bousseta Jul 12 '18 at 23:35
This answer does not work with the current version of the library, see the answer by @Matthias – betaman Aug 27 '18 at 09:54
wrap with org.apache.pdfbox.io.RandomAccessBufferedFileInputStream – qxo Nov 17 '18 at 07:13

score 6 · Answer 3 · answered Nov 27 '16 at 14:31

PdfBox 2.0.3 has a command line tool as well.

Download jar file
java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]

Options:
  -password  <password>        : Password to decrypt document
  -encoding  <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc.
  -console                     : Send text to console instead of file
  -html                        : Output in HTML format instead of raw text
  -sort                        : Sort the text before writing
  -ignoreBeads                 : Disables the separation by beads
  -debug                       : Enables debug output about the time consumption of every stage
  -startPage <number>          : The first page to start extraction(1 based)
  -endPage <number>            : The last page to extract(inclusive)
  <inputfile>                  : The PDF document to use
  [output-text-file]           : The file to write the text to

score 2 · Answer 4 · answered Jun 04 '18 at 10:16

Maven dep:

    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.9</version>
    </dependency>

Then the fucntion to get the pdf text as String.

private static String readPDF(File pdf) throws InvalidPasswordException, IOException {
    try (PDDocument document = PDDocument.load(pdf)) {

        document.getClass();

        if (!document.isEncrypted()) {

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();
            stripper.setSortByPosition(true);

            PDFTextStripper tStripper = new PDFTextStripper();

            String pdfFileInText = tStripper.getText(document);
            // System.out.println("Text:" + st);

            // split by whitespace
            String lines[] = pdfFileInText.split("\\r?\\n");
            List<String> pdfLines = new ArrayList<>();
            StringBuilder sb = new StringBuilder();
            for (String line : lines) {
                System.out.println(line);
                pdfLines.add(line);
                sb.append(line + "\n");
            }
            return sb.toString();
        }

    }
    return null;
}

score 0 · Answer 5 · answered Sep 14 '17 at 05:46

This works fine to extract data from a PDF file that has text content using pdfbox 2.0.6

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

public class PDFTextExtractor {
   public static void main(String[] args) throws IOException {
       System.out.println(readParaFromPDF("C:\\sample1.pdf",3, "Enter Start Text Here", "Enter Ending Text Here"));
    //Enter FilePath, Page Number, StartsWith, EndsWith
   }
   public static String readParaFromPDF(String pdfPath, int pageNo, String strStartIndentifier, String strEndIdentifier) {
       String returnString = "";
       try {
           PDDocument document = PDDocument.load(new File(pdfPath));
           document.getClass();        
           if (!document.isEncrypted()) {
               PDFTextStripperByArea stripper = new PDFTextStripperByArea();
               stripper.setSortByPosition(true);
               PDFTextStripper tStripper = new PDFTextStripper();
               tStripper.setStartPage(pageNo);
               tStripper.setEndPage(pageNo);
               String pdfFileInText = tStripper.getText(document);
               String strStart = strStartIndentifier;
               String strEnd = strEndIdentifier;
               int startInddex = pdfFileInText.indexOf(strStart);
               int endInddex = pdfFileInText.indexOf(strEnd);
               returnString = pdfFileInText.substring(startInddex, endInddex) + strEnd;
           }
          } catch (Exception e) {
              returnString = "No ParaGraph Found";
       }
            return returnString;
   }
}

How to extract text from a PDF file with Apache PDFBox

5 Answers5

Linked