Extract text from PDF (Table of Contents) Ignoring page and indexing numbers

Question

I am working on extracting text from PDF and save it in .csv file. Below image shows the text I am trying to extract from the PDF:

Currently, I am able to extract text but can't get rid of the numbers that indicate page numbers and indexing (i.e., numbers at the start and end of the text 1, 5, 1.1, 5, 1.2 etc...). Below is my working code (I am working on python 3.5):

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages = maxpages, password = password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()
    text = text.replace('\n\n', ' ').replace('\n',' ').replace('–',' ').replace('_',' ').replace('\t',' ').encode('ascii', errors='replace').decode('utf-8').replace("?","").replace("\x0c","").replace(".","").replace('\\',"").replace('/',"").replace('\r',"").replace("-"," ").replace(".......*"," ")
    text = " ".join(text.split())
    fp.close()
    device.close()
    retstr.close()
    return text

content = convert_pdf_to_txt('filename.pdf')

#print (content.encode('utf-8'))
s = StringIO(content)
with open('output.csv', 'w') as f:
    for line in s:
        f.write(line)

Thanks in advance for the help.

dalanicolai · Accepted Answer · 2019-09-16T11:49:51.497

The pdfminer documentation here shows how to do it in section 2.4.

For the record I'll copy-paste the relevant code here.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

# Open a PDF document.
fp = open('mypdf.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, password)
# Get the outlines of the document.
outlines = document.get_outlines()
for(level,title,dest,a,se) in outlines:
    print (' '.join(title.split(' ')[1:]))

The print statement was adapted to appropriately answer the question.

score 1 · Answer 2 · answered Feb 26 '19 at 13:30

1

You can just extract the TOC by mutool:

mutool show your.pdf outline > toc.txt

Then convert the content of txt to a csv file.

And I know mutool from this answer: Extract toc from pdf by mutool

answered Feb 26 '19 at 13:30

Denny

137
1
6

Or to use mupdf (for which mutool provides the command line API) from python, you can use the excellent pymupdf library: [extract toc with pymupdf](https://pymupdf.readthedocs.io/en/latest/tutorial.html#working-with-outlines) – dalanicolai Sep 25 '21 at 20:45

Extract text from PDF (Table of Contents) Ignoring page and indexing numbers

2 Answers2