4

I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.

The following code I already have:

Main

public static void main(String[] args) throws IOException {
    String src = "SEM_081145.pdf";

    PdfReader reader = new PdfReader(src);

    SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

    PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);

    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        // strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
    }
    out.flush();
    out.close();
}

And I have implemented the TextExtraction Strategy SemTextExtractionStrategy which looks like this:

public class SemTextExtractionStrategy implements TextExtractionStrategy {

private String text;

@Override
public void beginTextBlock() {
}

@Override
public void renderText(TextRenderInfo renderInfo) {
    text = renderInfo.getText();

    System.out.println(renderInfo.getFont().getFontType());

    System.out.print(text);
}

@Override
public void endTextBlock() {
}

@Override
public void renderImage(ImageRenderInfo renderInfo) {
}

@Override
public String getResultantText() {
    return text;
}
}

I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?

Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.

Alexis Pigeon
  • 7,423
  • 11
  • 39
  • 44
Prine
  • 12,192
  • 8
  • 40
  • 59

4 Answers4

10

Thanks to Alexis I could convert his C# solution into Java code:

text = renderInfo.getText();

Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();

Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
Prine
  • 12,192
  • 8
  • 40
  • 59
6

I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):

val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
  case (0, 0) => 0
  case _ => length1 / length2
}
Wilfred Springer
  • 10,869
  • 4
  • 55
  • 69
4

You can adapt the code provided in this answer, in particular this code snippet:

Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;

This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.

Community
  • 1
  • 1
Alexis Pigeon
  • 7,423
  • 11
  • 39
  • 44
  • Thx, gonna try it out later and post the java code for others ;) – Prine Jun 06 '12 at 12:23
  • It is working! Gonna post my Java solution as one answer. Thanks again! – Prine Jun 06 '12 at 15:50
  • 2
    A question about this calculation. Should we use the base line or the descent line here? If I use descent line, the resulting numbers seem to better match the "font size" shown by other applications (such as the OS X Preview PDF annotation tool). – Thilo Dec 10 '14 at 07:48
1

If you want the exact fontsize, use the following code in your renderText:

float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
     - renderInfo.getDescentLine().getStartPoint().get(1);

Modify this as indicated in the other answers for rorated text.

KimvdLinde
  • 587
  • 8
  • 19