0

I am trying to extract some content from PDF, specifically I want to extract all the text that is italic and I am using iText. Then I found this post get italic lines from a pdf very helpful. And I used a similar strategy.

However, for my pdfs, I just cannot extract italic font, so I tried to print out the font names, and I found all the font names are just not regular font names, like 'AdvPS_TTR', 'FFHNAB+AdvGulliv-I', 'PFIIDC+AdvOTce3d9a73', and this is the reason why I cannot extract italic text. Here is my question, what are these fonts? How do I know whether it is italic or not?

In case code are needed, I paste them below.

public class ItalicWordExtraction extends SimpleTextExtractionStrategy{

    @Override
    public void renderText(TextRenderInfo arg0){
        DocumentFont font = arg0.getFont();
        String[][] familyFontNamesArray = font.getFamilyFontName();
        for (String[] familyFontNames : familyFontNamesArray){
            for (String familyFontName : familyFontNames){
                System.out.println(familyFontName);
                if (familyFontName.toLowerCase().contains("italic"))
                {
                    if (font.getFontDescriptor(BaseFont.ITALICANGLE, 0) < 0)
                        super.renderText(arg0);
                    break;
                }
            }
        }
    }
}

Here is a sample file: http://www.megafileupload.com/2pHH/pdf2.pdf Look at the reference part, all the journal names are italic, that are what I want to extract.

Community
  • 1
  • 1
1a1a11a
  • 1,187
  • 2
  • 16
  • 25
  • I also found this page, but it is also not helpful enough. http://itextpdf.com/sandbox/parse/ParseCustom – 1a1a11a May 26 '15 at 04:27
  • I also tried getTextRenderMode, all the value is 0. – 1a1a11a May 26 '15 at 04:50
  • Besides, font.getFontDescriptor(BaseFont.ITALICANGLE, 0) the results are also all 0.0 – 1a1a11a May 26 '15 at 04:52
  • Thanks, Bruno! I did find this on your book, but after reading the page, which contains some comments, I fully understand it now. @BrunoLowagie – 1a1a11a May 26 '15 at 11:56
  • By the way, I have a question about text extracting, I compared PDFBox and iText, and I found that iText is much faster than PDFBox, but sometimes space between words is missing in the result of iText, which seldom happens in PDFBox, is there any parameter that I can tune to change the situation? @BrunoLowagie – 1a1a11a May 26 '15 at 15:41
  • You can create your own `TextExtractionStrategy` by subclassing the `LocationTextExtractionStrategy` overriding the `isChunkAtWordBoundary()` method. This method calculates the distance between two text chunks and compares it to the width of the space character of the font that is encountered. – Bruno Lowagie May 26 '15 at 16:01

0 Answers0