1

Now I can use the PDFBox to extract the outlines from PDF, but some PDF can get the outlines, others can't.

Every PDF has outlines and when I open a pdf use pdf read tool, I can click an outline to a certain page.

PDF can get outlines PDF can't get outlines

Here is my code:

public static void main(String[] args) {
     try {
        PDDocument document = PDDocument.load(new File(filePath));
        PDDocumentOutline outline = document.getDocumentCatalog().getDocumentOutline();
        getOutlines(document, outline, "");
        document.close();
    } catch (InvalidPasswordException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

public static void getOutlines(PDDocument document, PDOutlineNode bookmark, String indentation) throws IOException{
    PDOutlineItem current = bookmark.getFirstChild();
    while (current != null) {
        PDPage currentPage = current.findDestinationPage(document);
        Integer pageNumber = document.getDocumentCatalog().getPages().indexOf(currentPage) + 1;
        System.out.println(current.getTitle() + "-------->" + pageNumber);
        getOutlines(document, current, indentation);
        current = current.getNextSibling();
    }
}
th000
  • 83
  • 1
  • 9
  • 1
    If I understand you correctly, your code works for some files and not for others. Thus, it would be helpful if you shared an example file for which it does not work. – mkl Nov 22 '18 at 07:08
  • 1
    "Every PDF has outlines" - no. – Tilman Hausherr Nov 22 '18 at 07:22
  • @mkl [the PDF have outlines but can parse it](https://drive.google.com/open?id=1LgLPXJAi-e6lRaylvw6Rqna1Kcj7yHmB) and I want to know the structure of a PDF file from this [link](https://stackoverflow.com/questions/88582/structure-of-a-pdf-file) – th000 Nov 22 '18 at 07:27
  • @TilmanHausherr I means the PDFs I want to parse have outlines. – th000 Nov 22 '18 at 07:28
  • Your link requires individual permission. – Tilman Hausherr Nov 22 '18 at 07:32
  • @TilmanHausherr you sign in your account to view and download the pdf – th000 Nov 22 '18 at 07:35
  • That file does not have outlines. – Tilman Hausherr Nov 22 '18 at 07:55
  • @TilmanHausherr [try this](https://drive.google.com/file/d/1LgLPXJAi-e6lRaylvw6Rqna1Kcj7yHmB/view?usp=sharing) when I open the PDF I can click outline to a certain page(in the PDF page `Iv`) – th000 Nov 22 '18 at 07:56
  • 1
    That is the same file. It does not have outlines. "Outlines" is that thingie with text on the left in Adobe Reader that is sometimes visible and sometimes must be clicked to be visible. Your file has a table of contents on page 8 (shown as "IV" on the page). That is not an outline as in your screenshot. That's just a bunch of link annotations. – Tilman Hausherr Nov 22 '18 at 08:03
  • @TilmanHausherr does it has a method to get these `link annotations` just like the outlines? – th000 Nov 22 '18 at 08:06
  • Try the `PrintURLs.java` example from the source code download. Although this is about external links, you can adjust it for internal links. – Tilman Hausherr Nov 22 '18 at 08:52
  • I can only confirm @Tilman's analysis. And just a word of warning, it usually is much more difficult to retrieve correct structured information from TOCs in the page contents than it is for actual outlines. As you try to retrieve such data from many documents, be ready for some incorrect extractions and add a feature to your app for manual corrections of the extracted data. – mkl Nov 22 '18 at 09:27

0 Answers0