1

I have been trying to extract the attributes(font, font size, color etc.) of each word in a pdf document using iText library. I could extract the text from every page but not the attributes. Also i didn't find anything that could provide the same as such. Please help me.

srjit
  • 526
  • 11
  • 25
  • Possible duplicate of [How to check that all used fonts are embedded in PDF with Java iText?](http://stackoverflow.com/questions/4646130/how-to-check-that-all-used-fonts-are-embedded-in-pdf-with-java-itext) – Dave Jarvis Oct 04 '16 at 20:11

1 Answers1

1

I'm not a Java person so I can't give you working code but hopefully I can get you 95% of the way there.

First you'll need to create a class that implements the interface com.itextpdf.text.pdf.parser.TextExtractionStrategy

Then you can pass an instance of this class as the third parameter to:

PdfTextExtractor.getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy)

One of the methods of that interface is renderText which gets called for every text block that gets processed. When it gets called a TextRenderInfo gets passed in which has a method called getFont which should give you what you're looking for. Store the contents of that in a buffer of some sort and after getTextFromPage is called you can inspect that buffer to see each font. If you want to see an example of implementing that interface lookup the code for SimpleTextExtractionStrategy online. Otherwise here's a C# version that pretty much does what you're looking for.

Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Hi, Thanks a ton for the reply. I didn't know the concept of callback functions in Java (like 'renderText' here). I could get the font names applying getFullFontName() on the object from getFont() method mentioned in the documentation of "Document Font". :-) – srjit Feb 06 '12 at 19:08