0

Is it possible to get page title via iText?

  • The PdfTextExtractor returns all text from the page but I don't know what line is title. Also, title may contain more than one line
  • I don't know coordinates of title thus I can't use RegionTextRenderFilter
  • I can try to analyze the font size and take the line(s) with biggest font but TextRenderInfo doesn't provide public access to gs (private final GraphicsState gs)
  • Any other ideas?
Lazy
  • 267
  • 1
  • 8
  • 18

1 Answers1

1

Pages within a PDF don't have titles, they just have text that happens to be bold or in a large font and appears in an area you consider to be "more top" than other pieces of text. It sounds like you know this already, I just needed to be clear on this.

See my post here which shows how to get font information by subclassing ITextExtractionStrategy. My sample targets iTextSharp which is the .Net port of iText but they match pretty much feature-to-feature. The biggest differences is that Java uses getXXX and setXXX whereas .Net just uses XXX for both. Otherwise everything should port just fine.

The moral of the story is that you are going to have to write some arbitrary rules defining what you think of as a "title" and then parse based on those rules.

Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274