2

How to retrieve font type style attributes from pdf using pdfbox

satish john
  • 226
  • 1
  • 6
  • 14
  • Double? http://stackoverflow.com/questions/6939583/how-to-extract-font-styles-of-text-contents-using-pdfbox – Kim Jun 04 '12 at 12:22
  • Kim thanks for the reply... I tried this getting java.util.EmptyStackException at java.util.Stack.peek(Stack.java:85) at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:601) at pdf22box.main(pdf22box.java:13) – satish john Jun 05 '12 at 04:23
  • However, I am getting the text from the pdf – satish john Jun 05 '12 at 04:24
  • Getting following result after trying with getFonts. Could you help me understand the content {TT1=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@74b2002f, TT2=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@522a4983} {TT4=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@79f6f296, TT3=org.apache.pdfbox.pdmodel.font.PDTrueTypeFont@43b09468} – satish john Jun 05 '12 at 05:09
  • What I see are the objects and it's address. I guess you need to read out the content of those objects (aka by using it properties, like the name of the font etc). – Kim Jun 05 '12 at 12:14

1 Answers1

1

If you want to get the font of a single character in the pdf document, you can call textPosition.getFont().getFontDescriptor().getFontName(), where textPosition is a instance of the class TextPosition.

All characters of a PDF document are related to TextPosition objects.

You can get the TextPosition objects of a PDF document by overriding the processTextPosition(TextPosition t) method of PDFTextStripper or with the getCharactersByArticle() method of PDFTextStripper.

i.e. for latter - extend the PDFStripper class like this:

public class MyPDFTextStripper extends PDFTextStripper {

    public MyPDFTextStripper() throws IOException {
        super();
    }

    public Vector<List<TextPosition>> myGetCharactersByArticle() {
        return getCharactersByArticle();
    }
}

... to get the list of TextPositions for a single page use:

MyPDFTextStripper stripper = new MyPDFTextStripper();
PDDocument doc = PDDocument.load(new File(filename));
stripper.setStartPage(pageNr+1);
stripper.setEndPage(pageNr+1);
stripper.getText(doc);
Vector<List<TextPosition>> list = stripper.myGetCharactersByArticle();

... and finally to get the font for a single character just type:

textPosition.getFont().getFontDescriptor().getFontName()