PDF to text tool or Java library?

Question

I need to convert a PDF to normal text (it's the "statement of votes" from our county registrar). The files are big (2000 pages or so) and mostly contain tables. Once I get it into text, then I'm going to use a program I'm writing to parse it and put the data into a database. I've tried the 'Save as text' function in Adobe Reader, but it is not as precise as I'd like it, especially in delimiting the table data into CSV. So, any recommendations for tools or Java libraries that would do the trick?

I have a feeling the table data might cause you some headaches... — Knobloch, Feb 24 '09 at 21:15
Yes. Also the table headers and page headers. Although consistent throughout a document, they are not consistent between different documents. One document per election, and it seems like they keep changing the format each election. — Gary Kephart, Feb 24 '09 at 21:24

Michael Myers · Answer 1 · 2020-07-17T20:07:46.850

6

Two options:

iText - it seems the PdfTextExtractor class can do what you want.
Apache PDFBox claims "PDF to text extraction" as its top feature. There's an ExtractText command line tool specifically for this (source code), based on its PDFTextStripper class. And there's a PDFBox Text Extraction Guide, too!

edited Jul 17 '20 at 20:07

answered Feb 24 '09 at 21:11

Michael Myers

188,989
46
291
292

iText can do some reading, I think but there may be better tools (PDFBox as you mentioned, perhaps) to achieve that... – Knobloch Feb 24 '09 at 21:14
OK, just tried this out. It worked pretty good on the table data, however, the column headers were messed up, probably because they are vertically aligned text. – Gary Kephart Feb 24 '09 at 23:22
The reference to PDFBox, though at a different URL now, was still quite useful to me tonight! :-) – Arjan Aug 27 '12 at 20:02

score 4 · Answer 2 · answered Aug 28 '12 at 17:26

Given the title of the question: Apache Tika worked very well for me to extract plain text from PDF. I've not used it to get text from tables though.

For PDF it's actually using PDFBox. But besides PDF, it does the same for other formats like Microsoft Word (doc and docx), Excel and PowerPoint, OpenOffice.org/LibreOffice ODT, HTML, XML, and many more. Its AutoDetectParser makes fetching text from any input easy.

And if one needs to process the resulting text (like by passing it to Mahout for classification) one can use ParsingReader to get the result into a Reader while a background process extracts it. Finally, while extrating the content, it also fills the meta data it finds:

public Reader getPlainTextReader(final InputStream is) {
    try {
        Detector detector = new DefaultDetector();
        Parser parser = new AutoDetectParser(detector);
        ParseContext context = new ParseContext();
        context.set(Parser.class, parser);
        Metadata metadata = new Metadata();

        Reader reader = new ParsingReader(parser, is, metadata, context);

        for (String name : metadata.names()) {
            for (String value : metadata.getValues(name)) {
                logger.debug("Document {}: {}", name, value);
            }
        }

        return reader;

    } catch (IOException e) {
        ...
    }
}

cemerick · Answer 3 · 2012-07-23T15:09:02.170

PDFTextStream is our Java + .NET library for extracting content from PDF documents; you might give it a shot. Additionally, it does provide some rudimentary table data extraction utilities, which sit on top of PDFTextStream's table detection capabilities. It's by no means a general solution (though we're working on one of those, too!), but if the tabular data is clearly defined (e.g. rows and columns bounded by lines, etc), then you may find what's there now a proper solution.

score 2 · Answer 4 · answered Feb 24 '09 at 21:14

2

I have always found the xpdf tools very useful.

We successfully use the pdf to text conversion for converting PDF business documents for use in EDI. The option to preserve layout works well to keep things positioned well for parsing in a program.

answered Feb 24 '09 at 21:14

Jarod Elliott

15,460
3
35
34

1

This worked well for me. The -layout flag helped keep the tables in a usable format in the text file. – Tim Perry Jul 07 '10 at 23:14

score 0 · Answer 5 · answered Feb 24 '09 at 21:11

0

Use a text (line) printer to print to file.

answered Feb 24 '09 at 21:11

dirkgently

108,024
16
131
187

score 0 · Answer 6 · answered Feb 24 '09 at 23:25

0

I use iText and I"ve been really happy with it. I've used xmlpdf before and iText is far superior in my opinion.

answered Feb 24 '09 at 23:25

SacramentoJoe

1,219
2
10
10

score 0 · Answer 7 · answered Feb 24 '09 at 23:58

Without knowing the layout of the pages in your PDF it is difficult to say.

I would suggest downloading and trying both iText and PDBox. You will find text extract examples for both on their websites - you should have an extracter running in < 30mins assuming you know your way around Java.

Start with PDFBox as it's text extraction abilities are better than iText's.

Someone else has mentioned xpdf and that may be useful for you. It's a C library with some command line tools built around it. It has a number of text extracters and you may be able to format the output easily enough. Again, it really depends on your page layout.

PDF to text tool or Java library?

7 Answers7

Linked

Related