how to read conditional text from PDF?

Question

I want to read a particular section in a PDF. Hows that possible? like: if you visit the URl: Suppose I want to get only Part 1 data.

    URL url = new URL("https://www.uscis.gov/sites/default/files/files/form/i-129.pdf");

    InputStream is = url.openStream();
    BufferedInputStream fileParse = new BufferedInputStream(is);
    PDDocument document = null;
    document = PDDocument.load(fileParse);
    String pdfContent = new PDFTextStripper().getText(document);

    System.out.println(pdfContent);

Your example file is a hybrid AcroForm / XFA form. This gives you the choice of either using text extraction and AcroForm value retrieval or XFA XML parsing. Thus, are you interested only in PDFs with alternative XFA streams? And are you also interested in form fiill-ins or only in the static content? — mkl, Aug 28 '19 at 11:19

geco17 · Answer 1 · 2019-08-25T09:03:35.193

In your specific case you can set the start and end pages of the stripper such that you don't get the full document each time, then use some simple string operations to get what you need.

Here is a complete, more generic working example based on your code.

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.BufferedInputStream;
import java.io.InputStream;
import java.net.URL;

public class App {
    public static void main(String...args) throws Exception {
        String path = "..."; // replace with whatever path you need
        String startDelimiter = "..."; // replace with wherever the start is
        String endDelimiter = "...";
        URL url = new URL(path);
        InputStream is = url.openStream();
        BufferedInputStream fileParse = new BufferedInputStream(is);
        PDDocument document = PDDocument.load(fileParse);
        PDFTextStripper stripper = new PDFTextStripper();
        // set this stuff if you know more or less where it should be in the pdf to avoid stripping the whole thing
        stripper.setStartPage(1);
        stripper.setEndPage(3);
        // get the content
        String content = stripper.getText(document);
        String searchedContent = content.substring(content.indexOf(startDelimiter), content.indexOf(endDelimiter));
        System.out.println(searchedContent);
    }
}

If, on the other hand, you don't know where in the document you're looking, with a bit of work you can search the document in order to get the start page and end page or other stuff. See this similar question.

will it be possible to read only the label name where ever there is a field available? — hrishikesh basak, Sep 04 '19 at 11:25

how to read conditional text from PDF?

1 Answers1