PDF text extraction using iText

Question

We are doing research in information extraction, and we would like to use iText.

We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.

I'm not completely clear on what you are doing. Reading text and extracting text are generally the same thing. iText won't save the text to a file for you but once you have the text you should be able to do that fairly easily. iText does a really great job of extracting text as long as it is actually text (not outlines or bitmaps). When searching this site also look for `iTextSharp` which is the .Net port of iText. It has more questions/answers and the code is almost completely the same for C#. — Chris Haas, Jan 11 '12 at 19:01

score 21 · Answer 1 · edited Jul 21 '16 at 20:36

Like Theodore said you can extract text from a pdf and like Chris pointed out

as long as it is actually text (not outlines or bitmaps)

Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.

But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279

And you can parse it to create a plain txt file. Here is a code example:

/*
 * This class is part of the book "iText in Action - 2nd Edition"
 * written by Bruno Lowagie (ISBN: 9781935182610)
 * For more info, go to: http://itextpdf.com/examples/
 * This example only works with the AGPL version of iText.
 */

package part4.chapter15;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "resources/pdfs/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "results/part4/chapter15/preface.txt";

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            out.println(strategy.getResultantText());
        }
        reader.close();
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }
}

Notice the license

This example only works with the AGPL version of iText.

If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.

Hope it helps.

The program shown works like a charm, without a single modification! Thanks to Bruno and thanks to you for pointing it out. — gsl, Aug 18 '20 at 16:10

score 3 · Answer 2 · answered Jan 12 '12 at 10:10

iText allows you to do that, but there is no guarantee about the granularity of the text blocks, those depend on the actual pdf renderers used in producing your documents.

It's quite possible that each word or even letter has its own text block. Nor do these need to be in lexical order, for reliable results you may have to reorder text blocks based on their coordinates. Also you may have to calculate if you need to insert spaces between textblocks.

score 1 · Answer 3 · answered Sep 01 '22 at 12:00

In newer versions of itext:

public static void main(String[] args) throws Exception {
    try (var document = new PdfDocument(new PdfReader("your.pdf"))) {
        var strategy = new SimpleTextExtractionStrategy();
        for (int i = 1; i < document.getNumberOfPages(); i++) {
            String text = PdfTextExtractor.getTextFromPage(document.getPage(i), strategy);
            System.out.println(text);
        }
    }
}

PDF text extraction using iText

3 Answers3

Linked