I am trying to extract some content from PDF, specifically I want to extract all the text that is italic and I am using iText. Then I found this post get italic lines from a pdf very helpful. And I used a similar strategy.
However, for my pdfs, I just cannot extract italic font, so I tried to print out the font names, and I found all the font names are just not regular font names, like 'AdvPS_TTR', 'FFHNAB+AdvGulliv-I', 'PFIIDC+AdvOTce3d9a73', and this is the reason why I cannot extract italic text. Here is my question, what are these fonts? How do I know whether it is italic or not?
In case code are needed, I paste them below.
public class ItalicWordExtraction extends SimpleTextExtractionStrategy{
@Override
public void renderText(TextRenderInfo arg0){
DocumentFont font = arg0.getFont();
String[][] familyFontNamesArray = font.getFamilyFontName();
for (String[] familyFontNames : familyFontNamesArray){
for (String familyFontName : familyFontNames){
System.out.println(familyFontName);
if (familyFontName.toLowerCase().contains("italic"))
{
if (font.getFontDescriptor(BaseFont.ITALICANGLE, 0) < 0)
super.renderText(arg0);
break;
}
}
}
}
}
Here is a sample file: http://www.megafileupload.com/2pHH/pdf2.pdf Look at the reference part, all the journal names are italic, that are what I want to extract.