How to sort bookmarks in PyPDF2 / How to fix broken PDFs

Question

My question is similar to Change order of pdf bookmarks using PyPdf2, except that I need to sort the bookmarks in the destination PDF.

The following code "works" in that it creates a new PDF with sorted bookmarks, BUT their destinations are NOT clickable, because their actions are null when I view their properties in Adobe Acrobat Reader.

import PyPDF2
from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader("/Users/ME/Documents/in.pdf")
writer = PdfWriter()

outlines = reader.outlines
result = {}

for o in outlines:
    if isinstance(o, PyPDF2.generic.Destination):  # no sub-bookmarks
        result.update({o['/Title']: o})

sorted(result.items(), key=lambda item: item[0])

for pageNum in range(reader.numPages):
    writer.addPage(reader.getPage(pageNum))

newPath = '/Users/ME/Documents/out.pdf'
resultPdf = open(newPath, 'wb')

for k,v in result.items():
    writer.add_bookmark_dict(v)

writer.write(resultPdf)
resultPdf.close()

How can I adjust the code above so that the bookmarks are clickable?

thanks, could you share a working example of sorting bookmarks as described above? — mellow-yellow, Jun 17 '22 at 16:51
thanks, but after reviewing the report.txt, and the possibility of update_info with it (and seeing how bookmarks somehow lose their "zoomed in" properties in the output), I don't see how this workflow would meet the requirement above; it's also not a PyPDF2 solution, although that's not a deal breaker (but would deserve a different Stack Overflow question and answer). — mellow-yellow, Jun 17 '22 at 22:16

mellow-yellow · Answer 1 · 2022-10-08T14:56:01.257

I solved this myself in two ways:

Old way: PyPDF2 (not recommended)
New way: pikepdf (recommended)

Both have these advantages:

they create a new PDF, effectively fixing many broken PDFs
the resulting PDF retains the zoom settings of the original

Both have these disadvantages:

the resulting PDF has any duplicate-named root (i.e., top-level parent) bookmarks deleted (the children are untouched).

To install them, just copy and paste the code into freesort.py somewhere on your computer, and then open a command line (i.e., shell) and run python3 freesort.py OR do a chmod +x freesort.py (to make the file executable) then ./freesort.py...

pikepdf:

#!/usr/bin/env python3

"""
freesort.py 2022-07-08 Sean W
Purpose: sort top-level bookmarks only (i.e., leave children alone)
Usage: freesort.py /input/path/foo.pdf /output/path/foo.pdf")
Prereqs: pip3 install pikepdf
"""

from pikepdf import Pdf, OutlineItem
from re import compile, split
import sys

try:
    input_file  = sys.argv[1]
    output_file = sys.argv[2]
except Exception as e:
    print(f"Error: {e}. Please check your paths.\nUsage: freesort.py /input/path/foo.pdf /output/path/foo.pdf")
    sys.exit(1)

pdf = Pdf.open(input_file, allow_overwriting_input=True)
bookmarks_unsorted = {}
bookmarks = {}

with pdf.open_outline() as outline:
    # extract
    for o in outline.root:
        bookmarks_unsorted.update({o.title: o})
    del outline.root[:]

    # sort (first parent only) - thanks to https://stackoverflow.com/a/37036428/1231693
    dre = compile(r'(\d+)')
    bookmarks = dict(sorted(bookmarks_unsorted.items(),
                            key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l[0])]))

    # create
    for key, val in bookmarks.items():
        outline.root.append(val)

pdf.save(output_file)

PyPDF2

#!/usr/bin/env python3

"""
freesort.py 2022-06-21 Sean W
Purpose: sort top-level bookmarks only (i.e., leave children alone)
Usage: freesort.py /input/path/foo.pdf /output/path/foo.pdf")
Prereqs: pip3 install PyPDF2
"""

import PyPDF2
from PyPDF2 import PdfReader, PdfWriter
import sys

try:
    input_file  = sys.argv[1]
    output_file = sys.argv[2]
except Exception as e:
    print(f"Error: {e}. Please check your paths.\nUsage: freesort.py /input/path/foo.pdf /output/path/foo.pdf")
    sys.exit(1)

reader = PdfReader(input_file)
writer = PdfWriter()
parents_before = {}  # before sorting
parents_after = {}   # after sorting

outlines = reader.getOutlines()
for o in outlines:
    if isinstance(o, PyPDF2.generic.Destination):  # no sub-bookmarks
        parents_before.update({o['/Title']: outlines.index(o)})

parents_before = dict(sorted(parents_before.items()))

# copy content (this includes annotations)
for pageNum in range(reader.numPages):
    writer.addPage(reader.getPage(pageNum))

def add_item(outline_item, parent = None):
    fit = outline_item['/Type']
    if fit == '/XYZ':
        zoom = [outline_item['/Left'], outline_item['/Top'], outline_item['/Zoom']]
    else:
        zoom = [outline_item['/Top']]

    ref = writer.add_bookmark(str(outline_item["/Title"]),
                               reader.getDestinationPageNumber(outline_item),  # page num
                               parent,                                         # parent
                               (0, 0, 0),                                      # color
                               True,                                           # bold
                               False,                                          # italic
                               fit,
                               *zoom)

    return ref

# create parents first
for k, v in parents_before.items():
    parents_after[v] = add_item(outlines[v])

# now children
for o in outlines:
    if isinstance(o, list):  # children only
        i = outlines.index(o)
        for l in o:          # each child
            add_item(l, parents_after[i - 1])

# save
result_pdf = open(output_file, 'wb')
writer.write(result_pdf)
result_pdf.close()

How to sort bookmarks in PyPDF2 / How to fix broken PDFs

1 Answers1