0

I have a consolidated pdf files which has text in each page with id and page numbers as "Page X of Y". I am in need of splitting one pdf file into multiple pdf files based on Page X of Y text. I am trying to do POC using iText but I am struggling to read Page X of Y to identify the page numbers which I need to use to split the file. May I get some light on implementing this using Java?

I tried the below code:

 public static void main(String args[]) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("C:\\basics\\outbound\\FPPStmts.pdf");
        try {
            // PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead 
             RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
             PDFParser parser = new PDFParser(randomAccessFile);               
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(2);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } 
    }

This is resulting me blank text though my pdf is having data.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Venkata Ramireddy CH
  • 743
  • 4
  • 14
  • 30
  • Possible duplicate of [How to extract text from a PDF file with Apache PDFBox](https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox) – Stef Feb 26 '19 at 10:52
  • Your code isn't itext, it is PDFBox. – Tilman Hausherr Feb 26 '19 at 12:07
  • Use the latest PDFBox version (2.0.13) and read the FAQ: https://pdfbox.apache.org/2.0/faq.html#how-come-i-am-not-getting-any-text-from-the-pdf-document – Tilman Hausherr Feb 26 '19 at 12:08
  • *"PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead"* - actually the recommended way to read a PDF is by using `PDDocument.load`, not by using `PDFParser` directly, so you normally don't have to care what arguments `PDFParser` requires... – mkl Feb 26 '19 at 12:10
  • 1
    Concerning @Stef's referenced question: ignore the accepted answer at first (as it is very pre-2.0.0-ish) and look at the newer answers. If their respective code also only extracts blank text, please share the PDF in question to allow reproducing the issue. – mkl Feb 26 '19 at 12:15

0 Answers0