1

I've been working on a VB.NET project to dynamically create report packs in PDF format using a SQL database and a number of input PDF templates. To cut a long story short, due to the way that Business Objects creates the input files it will be much more efficient to allow input of compiled PDF reports rather than individual report template pages. In order for this to work however, we would need to split the input PDF files into sections using the Bookmarks created by BOBJ. We are not sure how many pages will be in the range of each bookmark but require a consistent naming convention of the split files so that the next part of the process can pick the correct templates up and merge them in the required combinations.

The second part of this process is designed and working well using a .Net library called PDFSHARP. I have used the samples on their website to write some code which splits an input PDF file into one section per page of the input file, but do not understand how to split it using the bookmarks.

If I could understand how to parse the PDF and read in the meta data for the bookmarks which contain the start page and end page and the name of the bookmark then I think I could finish it.

An example of the input PDF format is here: https://drive.google.com/open?id=0B0GZGW6CFCI-UWY2WGRvV0dQSWZSNnNOWlp4R21zbFVPZDBn

There are 5 bookmarks (TID01, TID02 ...) and 6 pages. Section TID04 would have two pages output.

The file names I would need would be in the format of "ExamplePDF_TID01.pdf"

Any help to move forward would be greatly appreciated. - Looking on the wiki for the project it seems that it isn't very active any more and whilst other people have asked questions about this in the past there aren't any answers that I can find.

Code to Split by Page:

Sub Splitfiles()

    Dim inputdir As String = "O:\Transformation\Standardisation\Input PDFs"
    Dim outputdir As String = "O:\Transformation\Standardisation\Input PDFs\output\"
    'inputdir = folder path containing input files
    Dim fileEntries As String() = Directory.GetFiles(inputdir)
    Dim filename As String
    Dim pdfpage As PdfPage
    Dim ccid As String
    Dim pageid As Integer
    Dim outputfilename As String
    For Each filename In fileEntries
        Dim importdoc As PdfDocument = PdfReader.Open(filename, PdfSharp.Pdf.IO.PdfDocumentOpenMode.Import)
        Dim count As Integer = importdoc.PageCount
        Dim x = 0

        Do Until x = count
            Dim outputdoc As PdfDocument = New PdfDocument
            pdfpage = importdoc.Pages(x)
            outputdoc.AddPage(pdfpage)
            ccid = Strings.Right(filename, Len(filename) - Len(inputdir)) 'expand this to find CC ID
            ccid = Strings.Left(ccid, Len(ccid) - 4)
            pageid = x
            outputfilename = outputdir & ccid & "_" & pageid & ".pdf"
            outputdoc.Save(outputfilename)
            x = x + 1
        Loop

    Next
End Sub

And the code I started to split by bookmark but couldn't finish:

    Sub SplitPDFByBookmark()

    Dim inputfile As String = "O:\Transformation\Standardisation\Input PDFs\Business Sub Area Report - Project Management - FY16_FP02 - 17062016_0709.PDF"
    Dim outputdir As String = "O:\Transformation\Standardisation\Input PDFs\output\"
    'inputdir = folder path containing input files
    'Dim fileEntries As String() = Directory.GetFiles(inputdir)
    Dim filename As String
    Dim pdfpage As PdfPage
    Dim ccid As String
    Dim pageid As Integer
    Dim outputfilename As String
    filename = inputfile
    'For Each filename In fileEntries
    Dim importdoc As PdfDocument = PdfReader.Open(filename, PdfSharp.Pdf.IO.PdfDocumentOpenMode.Import)
    Dim count As Integer = importdoc.PageCount
    Dim x = 0

    For Each bookmark In importdoc.Outlines
        Dim outputdoc As PdfDocument = New PdfDocument
        pdfpage = importdoc.Pages(importdoc.Outlines.)
        outputdoc.AddPage(pdfpage)
        pageid = x
        outputfilename = outputdir & "OutputFile_" & pageid & ".pdf"
        outputdoc.Save(outputfilename)
        x = x + 1
    Next

    'Next
End Sub

Thanks in advance for your help!

user2916488
  • 81
  • 1
  • 2
  • 15
  • This post also seems to hint on an answer, unfortunately it is written in C# which I don't know but will have a go at trying to convert. http://stackoverflow.com/questions/9884414/how-to-read-pdf-bookmarks-programmatically/37890735#37890735 – user2916488 Sep 16 '16 at 12:36
  • I don't have time for an full answer but last time I looked at this I believe the `document.Outlines` field was not populated. You will want to use lower level document access as shown in [my answer](http://stackoverflow.com/questions/9884414/how-to-read-pdf-bookmarks-programmatically/37890735?noredirect=1#comment66377931_37890735) This solution is requires some jumping through hoops to get a page number and will need to be extended to suit your needs. The PDF spec and PDFSharp source code are great resources. Good Luck! – 0xcaff Sep 16 '16 at 18:01
  • Thank you, have now run into the pdf version 6 issue too so may start looking for alternatives to future proof the solution. – user2916488 Sep 18 '16 at 07:07
  • It isn't that hard to add support for xref (iref?) streams. You could fork PDFSharp. The PDF Spec is your friend. – 0xcaff Sep 19 '16 at 00:16

0 Answers0