PyPDF2 hangs on processing

Question

I'm processing multiple pdf files using PyPDF2 but my script hangs somewhere. All I can see in my console is some "startxref on same line as offset" which I'm correct is a warning so by right it should still go to the finally block and return an empty string.

Am I doing something wrong?

import PyPDF2
import sys
import os
def decode_pdf(src_filename):           
    out_str=""
    try:
        f = open(str(src_filename), "rb")           
        read_pdf = PyPDF2.PdfFileReader(f)
        number_of_pages = read_pdf.getNumPages()
        for i in range(0,number_of_pages):
            page = read_pdf.getPage(i)
            out_str = out_str + " " + page.extractText()
        out_str = ''.join(out_str.splitlines())
        f.close()
    except:
        print("Exception on pdf")
        print(sys.exc_info())
        out_str = ""
    finally:
        return out_str

I cannot reproduce any errors. This code works just fine for me. Can you update your post with the exact error you are getting? Is this error only occurring on large PDF files? — A Magoon, Aug 08 '17 at 18:28
Unable to reproduce with "some error" and "some file". If there is a single file that consistently produces that one error, share it so we can check. — Jongware, Jan 27 '18 at 00:05

Krishna · Answer 1 · 2018-01-27T00:13:53.017

I was facing this issue too and couldn't solve it using PyPDF2. I solved mine with pdfminer using the example from here

Copying the relevant code here below

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

call the function convert() as below

convert('myfile.pdf', pages=[5,7])

Could you quote the relevant parts of the linked resource in your answer? As-is, your answer is very susceptible to link rot (i.e. if the linked website goes down or changes, your answer is not useful). — mech, Jan 26 '18 at 23:46

PyPDF2 hangs on processing

1 Answers1

Linked