2

How can I determine the number of Lines in a Page in Apache PDFBox in Java?

I need to split each page to three different pages to do some statistics on each part. Then, I need to determine how many lines the page has. After that I would need to go through each line and write as many lines as I need to a new page.

I am wondering if it is possible using PDFBox. (I am completely new with this library and need to figure it out quickly)

Suo6613
  • 431
  • 5
  • 17
  • 2
    Your question is a bit confusing - do you just need to know the text line count on a page (very easy with PDFTextStripper), or are you also asking about how to split a page in three pages? The last will be difficult. Lines in a PDF must not be in sequence, and they usually aren't. PDF is not HTML. – Tilman Hausherr May 25 '15 at 17:38
  • 1
    What do you mean by "lines"? Lines of text or vector elements? – David van Driessche May 25 '15 at 17:38
  • I mean lines of text. I don't want to make it complex at this point. I just need to split a page in three different pages. For example if I have a document contains 3 pages, I would need to create a new document contains 9 pages (with the exact text). Is that doable? – Suo6613 May 25 '15 at 18:01
  • @Tilman Hausherr: how can I count the number of text lines using `PDFTextStripper`? I can not find any method responsible for that! – Suo6613 May 25 '15 at 18:35
  • First of all, do the documents you want to analyze contain easy to determine lines? Especially in case of scanned (and ocr'ed) pages with not properly positioned originals, lines might be difficult to find. Inlined formulas can make line recognition difficult, too. – mkl May 25 '15 at 19:50
  • 1
    To count the number of lines see the answer by Luis. (Set the start and endpage). However splitting the page in three is almost impossible. PDF doesn't even have the concept of a text line. Most PDFs output a few chars or words at a time, and often not in the sequence that you see on the screen. You would have to analyze the content stream and then for each little partial sequence, decide on which of the three pages it should go. It might be easier if all PDF files come from the same producer. – Tilman Hausherr May 25 '15 at 20:24
  • 2
    Actually splitting indeed is difficult but the OP might make use of a technique akin to the one used in [this answer](http://stackoverflow.com/a/29078954/1729265) for iText: he might put the original page into a Xobject and paint clipped parts of that Xobject onto target pages. – mkl May 26 '15 at 04:11

1 Answers1

4

Checkout this example that I made for you hope it helps

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

import java.io.*;
import java.util.List;
/**
 * Created by ljcp on 5/25/15.
 */

public class TestReadLinePdf {

    public static void  main(String [] args) {

        try {

            File pdfFile = new File("/Users/ljcp/Desktop/test2.pdf");
            PDDocument pdDocument = PDDocument.load(pdfFile);


            List allPages = pdDocument.getDocumentCatalog().getAllPages();
            for (int i = 1; i <= allPages.size(); i++) {
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setStartPage(i);
                stripper.setEndPage(i);
                String text = stripper.getText(pdDocument).replaceAll("visiblespace", " ");

                String[] lines = text.split("\n");
                System.out.println("Page Number " + i + " lines " + lines.length);
            }

        } catch(Exception e){
            System.out.print(e);
        }
    }
}
Luis
  • 1,242
  • 11
  • 18
  • 2
    Your answer doesn't count the lines per page, it counts them for all pages. – Tilman Hausherr May 25 '15 at 20:26
  • For that you need to handle the textstripper by page , just do a loop and set the startPage and EndPage to 1 for reading the first page and the same for the rest – Luis May 25 '15 at 20:33
  • I know that. I'm just suggesting that you update your answer so that you would answer the first part of the question. – Tilman Hausherr May 25 '15 at 22:35
  • Thank you so much. Just a question! what does the following line do? `String text = stripper.getText(pdDocument).replaceAll("visiblespace", " ");` Why do we need that? – Suo6613 May 26 '15 at 14:29
  • @Suo6613 Did this answer solve your problem? If so, please accept it. – Artjom B. Jul 20 '15 at 19:08
  • Since I had overridden some of functions I could not use the solution. I finally had to switch to another idea rather than splitting the page into three different pages. – Suo6613 Jul 21 '15 at 19:24