I have been try to extract text from pdf and thanks to itext i can extract whole text. However, i am trying to detect headings' fonts and by using this info i am planning to extract only those texts between two specific headings. For example in a scientific paper i want to extract only "introduction" part. To do this i applied to the following link.
Getting Text fonts from a pdf file using iText
However, it seems to give the same font type for all words which is not correct when i check it manually(copy paste to word document enables me to see the different fonts). Here is the code that i wrote.
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new OutputStreamWriter(new FileOutputStream(txt), "UTF-8"));
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
for (int j = 1; j <= reader.getNumberOfPages(); j++) {
out.println(PdfTextExtractor.getTextFromPage(reader, j, semTextExtractionStrategy));}
out.flush();
out.close();
And the class that I create for extraction strategy.
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
System.out.println(renderInfo.getFont().getFontType());
System.out.println(renderInfo.getFont().getFullFontName());
System.out.println(text);
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
}
@Override
public String getResultantText() {
return text;
}
}
public static void main(String args[]) {
trial credentials = new trial();
}}
As a result of this code I get such results. All of them have font type 4.
...
4 --> font type
[[Ljava.lang.String;@4371767c --> font getFullFontName() ---> it must be HelveticaNeue-Bold
INTRODUCTION --> original text
4
[[Ljava.lang.String;@4e19e97b --> it must be AGaramond-Regular
We
4
[[Ljava.lang.String;@72fb24c --> it must be AGaramond-Regular
have
...