pdfbox getcharacterbyarticle() rendering the vector for last page

Question

I am trying to get text details like co-ordinates, width and height using the following code (took up this solution from here), but the output was only the text from the last page.

Code

public static void main( String[] args ) throws IOException    {
        PDDocument document = null;
        String fileName = "apache.pdf"

        PDFParser parser = new PDFParser(new FileInputStream(fileName));
        parser.parse();

        StringWriter outString = new StringWriter();

        CustomPDFTextStripper stripper = new CustomPDFTextStripper();
        stripper.writeText(parser.getPDDocument(), outString);

        Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();

        for (int i = 0; i < vectorlistoftps.size(); i++) {
            List<TextPosition> tplist = vectorlistoftps.get(i);
            for (int j = 0; j < tplist.size(); j++) {
                TextPosition text = tplist.get(j);
                System.out.println(" String "
                        + "[x: " + text.getXDirAdj() + ", y: "
                        + text.getY() + ", height:" + text.getHeightDir()
                        + ", space: " + text.getWidthOfSpace() + ", width: "
                        + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
                        + text.getCharacter() +" Font "+ text.getFont().getBaseFont() + " PageNUm "+ (i+1));
            }
        }
}

CustomPDFTextStripper class:

class CustomPDFTextStripper extends PDFTextStripper
{
    //Vector<Vector<List<TextPosition>>> data = new Vector<Vector<List<TextPosition>>>();
    public CustomPDFTextStripper() throws IOException {
        super();
    }

    public Vector<List<TextPosition>> getCharactersByArticle(){
       // data.add(charactersByArticle);
        return charactersByArticle;
    }
}

I tried to add the vectors to a list, but when calling the stripper() it is iterating through all the pages and the last page details are stored in charactersByArticle vector and thus returning the same. How do I get info for all pages??

did you try `stripper.setStartPage()` and `stripper.setEndPage()`? — Tilman Hausherr, May 21 '18 at 19:56
Hi @TilmanHausherr , I tried that but I got text info only for the page I set in **stripper.setEndPage()**. So, I temporarily fixed it by iterating through no.of pages in pdf and changing the value in **setEndPage()** for each iteration. I'm looking for a better solution than this. Thank You. — ksa, May 22 '18 at 07:28

score 0 · Accepted Answer · answered Jun 08 '18 at 15:26

Temporary Fix:

Changed the main method to set the current page as end page and getting the text info. Not a good idea though.

 for (int page = 0; page < pageCount; page++)
                    {
        stripper.setStartPage(0);
        stripper.setEndPage(page + 1);
        stripper.writeText(parser.getPDDocument(), outString);
        Vector vectorlistoftps = stripper.getCharactersByArticle();
        PDPage thisPage = stripper.getCurrentPage();
        for (int i = 0; i < vectorlistoftps.size(); i++) {
                List<TextPosition> tplist = vectorlistoftps.get(i);
        }
    }

pdfbox getcharacterbyarticle() rendering the vector for last page

1 Answers1