0

Area of coding: PDF Table of Contents in python3 using pyPDF2

Problem: I need a program that can iterate through a union variable that contains multiple dictionaries, then multiple lists which contains multiple dictionaries.

[
    {},
    [{}, {}, {}],
    {},
    [{}, {}, {}],
    {},
    [{}, {}, {}]
]

This pattern repeats multiple times.

Expected output: The output should look like this

1 Title Goes Here
   1.1 Title Goes Here
       1.1.1 Title Goes Here
       1.1.2 Title Goes Here
       1.1.3 Title Goes Here
   1.2 Title Goes Here
       1.2.1 Title Goes Here
       1.2.2 Title Goes Here
       1.2.3 Title Goes Here
   1.3 Title Goes Here
       1.3.1 Title Goes Here
       1.3.2 Title Goes Here
       1.3.3 Title Goes Here

2 Title Goes Here
   2.1 Title Goes Here
       2.1.1 Title Goes Here
       2.1.2 Title Goes Here
       2.1.3 Title Goes Here
   2.2 Title Goes Here
       2.2.1 Title Goes Here
       2.2.2 Title Goes Here
       2.2.3 Title Goes Here
   2.3 Title Goes Here
       2.3.1 Title Goes Here
       2.3.2 Title Goes Here
       2.3.3 Title Goes Here

Program:

import argparse as arp
from PyPDF2 import PdfFileReader

parser = arp.ArgumentParser()
parser.add_argument("-f", "--file", help="File to analyse")
arg = parser.parse_args()
filename = arg.file

def fileread():
    doc = PdfFileReader(filename)
    ToC = doc.getOutlines()

    # ToC: Union[List[Union[Destination, list]], {__eq__}] = doc.getOutlines()

    for elements in ToC:
        #print(elements)
        #print("\n")

        try:
            if elements is {}: # If the element is a dictionary just find the Title
                print(elements['/Title']) # TODO: This is just skipped 

            else: # If the element is a list go through and print out the titles
                for nest_dict in elements:
                    try:
                        print(nest_dict["/Title"])
                    except:
                        continue
        except:
            continue

fileread()

I'm testing this program on: Compilers - Principles, Techniques, and Tools-Pearson_Addison Wesley (2006).pdf

Any help is much appreciated.

JonSG
  • 10,542
  • 2
  • 25
  • 36
  • Since this is a nested tree, can you see a way that the tree kind of looks the same no matter where you are in it? Perhaps `fileread()` should be recursive. – JonSG Jan 23 '22 at 16:46
  • Does this answer your question? [How to check if a variable is a dictionary in Python?](https://stackoverflow.com/questions/25231989/how-to-check-if-a-variable-is-a-dictionary-in-python) – kaya3 Jan 24 '22 at 00:32

2 Answers2

1

This line is not right:

        if elements is {}: # If the element is a dictionary just find the Title

It should instead read:

        if isinstance(elements, dict):
dshin
  • 2,354
  • 19
  • 29
1

With the code below, I am able to get such output from your pdf file:

Output:

1 Introduction
1.1 Language Processors
1.1.1 Exercises for Section 1.1
1.2 The Structure of a Compiler
...
2 A Simple Syntax-Directed Translator
2.1 Introduction
2.2 Syntax Definition
2.2.1 Definition of Grammars
...

Python code:

import argparse as arp
from PyPDF2 import PdfFileReader

parser = arp.ArgumentParser()
parser.add_argument("-f", "--file", help="File to analyse")
arg = parser.parse_args()
filename = arg.file

def fileread():
    doc = PdfFileReader(filename)
    ToC = doc.getOutlines()

    for elements in ToC:
        try:
            def print_title(input_data):
               if isinstance(input_data, dict):
                    print(input_data['/Title'])
               else:
                    for nest_dict in input_data:
                        try:
                            print_title(nest_dict)
                        except:
                            continue
            print_title(elements)
        
        except:
            continue       
fileread()

I'm not an expert in Python, but hope this will help you. By the way, you can read some info about recursions in Python here

mozello
  • 1,083
  • 3
  • 8