1

I have been try to extract text from pdf and thanks to itext i can extract whole text. However, i am trying to detect headings' fonts and by using this info i am planning to extract only those texts between two specific headings. For example in a scientific paper i want to extract only "introduction" part. To do this i applied to the following link.

Getting Text fonts from a pdf file using iText

However, it seems to give the same font type for all words which is not correct when i check it manually(copy paste to word document enables me to see the different fonts). Here is the code that i wrote.

PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new OutputStreamWriter(new FileOutputStream(txt), "UTF-8"));
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

for (int j = 1; j <= reader.getNumberOfPages(); j++) {
out.println(PdfTextExtractor.getTextFromPage(reader, j, semTextExtractionStrategy));}

        out.flush();
        out.close();

And the class that I create for extraction strategy.

       public class SemTextExtractionStrategy implements TextExtractionStrategy {

private String text;

@Override
public void beginTextBlock() {
}

@Override
public void renderText(TextRenderInfo renderInfo) {
    text = renderInfo.getText();

    System.out.println(renderInfo.getFont().getFontType());
    System.out.println(renderInfo.getFont().getFullFontName());
    System.out.println(text);
}

@Override
public void endTextBlock() {
}

@Override
public void renderImage(ImageRenderInfo renderInfo) {
}

@Override
public String getResultantText() {
    return text;
}
}



public static void main(String args[])  {

trial credentials = new trial();
}}

As a result of this code I get such results. All of them have font type 4.

...
4                             --> font type
[[Ljava.lang.String;@4371767c --> font getFullFontName() ---> it must be HelveticaNeue-Bold
INTRODUCTION                  --> original text

4
[[Ljava.lang.String;@4e19e97b --> it must be AGaramond-Regular
We

4
[[Ljava.lang.String;@72fb24c  --> it must be AGaramond-Regular
have

...

Community
  • 1
  • 1
mlee_jordan
  • 772
  • 4
  • 18
  • 50

1 Answers1

2

When you get to know Java better, you'll learn that outputs like yours

[[Ljava.lang.String;@4371767c --> font getFullFontName() ---> it must be HelveticaNeue-Bold
[[Ljava.lang.String;@4e19e97b --> it must be AGaramond-Regular
[[Ljava.lang.String;@72fb24c  --> it must be AGaramond-Regular

are typical String representations of arrays of arrays of Strings.

Thus, for your inspection of the values, you should start by iterating over the array returned by font getFullFontName(); as each entry again is an array, you should iterate over them, too; the entries therein are Strings and, therefore, the elements you want to print out.

If you want to know what this array of array of String contains, you'll also learn to appreciate the benefits of looking at the code or at least JavaDocs of third party libraries; in case of your line

System.out.println(renderInfo.getFont().getFullFontName());

you find this description of the method getFullFontName in BaseFont.java:

/** Gets the full name of the font. If it is a True Type font
 * each array element will have {Platform ID, Platform Encoding ID,
 * Language ID, font name}. The interpretation of this values can be
 * found in the Open Type specification, chapter 2, in the 'name' table.<br>
 * For the other fonts the array has a single element with {"", "", "",
 * font name}.
 * @return the full name of the font
 */
public abstract String[][] getFullFontName();

Take a look at the FontFactoryExample example to get an idea of the information stored in this two-dimensional array: font_factory.pdf

You may want to use the getPostscriptFontName() method instead.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you @mkl. When iterating the arrays i access the names. However, this time similarly i wanna retrieve the font size for each text. I can see font size when i debug the script under "gs". But when i try to reach it through "renderInfo" i could not make it? – mlee_jordan Nov 07 '14 at 18:25
  • I'm afraid the font size is not officially available. Using reflection you can access it via that gs member, though. – mkl Nov 07 '14 at 18:58
  • by applying this solution stackoverflow.com/questions/10879336/… it seems i managed to get font size for each text. However, i observed that even in a same text part (for ex. introduction part of an article) the font sizes are not stable. is it possible in a pdf or might the solution that i use be wrong? Thanks in advance. @mkl – mlee_jordan Nov 10 '14 at 20:18
  • 1
    *font sizes are not stable* - That is because the [solution you refer to](http://stackoverflow.com/a/10896457/1729265) actually returns the ascent, nor the font size. – mkl Nov 11 '14 at 10:02