Reading text in pdf file and split into multiple pdf files

Question

I have a consolidated pdf files which has text in each page with id and page numbers as "Page X of Y". I am in need of splitting one pdf file into multiple pdf files based on Page X of Y text. I am trying to do POC using iText but I am struggling to read Page X of Y to identify the page numbers which I need to use to split the file. May I get some light on implementing this using Java?

I tried the below code:

 public static void main(String args[]) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("C:\\basics\\outbound\\FPPStmts.pdf");
        try {
            // PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead 
             RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
             PDFParser parser = new PDFParser(randomAccessFile);               
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(2);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } 
    }

This is resulting me blank text though my pdf is having data.

Possible duplicate of [How to extract text from a PDF file with Apache PDFBox](https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox) — Stef, Feb 26 '19 at 10:52
Use the latest PDFBox version (2.0.13) and read the FAQ: https://pdfbox.apache.org/2.0/faq.html#how-come-i-am-not-getting-any-text-from-the-pdf-document — Tilman Hausherr, Feb 26 '19 at 12:08
*"PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead"* - actually the recommended way to read a PDF is by using `PDDocument.load`, not by using `PDFParser` directly, so you normally don't have to care what arguments `PDFParser` requires... — mkl, Feb 26 '19 at 12:10
Concerning @Stef's referenced question: ignore the accepted answer at first (as it is very pre-2.0.0-ish) and look at the newer answers. If their respective code also only extracts blank text, please share the PDF in question to allow reproducing the issue. — mkl, Feb 26 '19 at 12:15

Reading text in pdf file and split into multiple pdf files

0 Answers0