1

I have a multitude of PDFs with different structures and i need to extract the text from them and find some key indicators.

I am using pyPdf module and in case the PDFs is not retriving any text, I am also using PDF Miner.

The problem is that for some of the files, no modules work, in the sense that no text is extracted from the PDF. I saw that some of them are scanned or only image PDF but some of them appear to have a constant structuture as the ones that can be parsed.

Here are the 2 functions I use, maybe I am missing something:

Using pyPdf

def getPDFContent(path):
        content = ""
        pdf = pyPdf.PdfFileReader(file(path, "rb"))
        for i in range(0, pdf.getNumPages()):
            content += pdf.getPage(i).extractText() + " "
        content = " ".join(content.replace(u"/xa0", " ").strip().split())
        return content
mt = getPDFContent(filename).encode("ascii", "xmlcharrefreplace")

Using PDF Miner

def getPDFContent(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
        retstr.write("nextpage")
    text = retstr.getvalue() 

    fp.close()
    device.close()
    retstr.close()
    return text
Nick Dragosh
  • 495
  • 1
  • 9
  • 21
  • I generally use http://www.foolabs.com/xpdf/download.html to extract text from PDF files. However, if they just contain scanned images, you may need to run use an OCR tool to extract the text. Please note that some PDF documents can be copyright protected and in that case the extraction will fail. – Alex Dec 07 '15 at 09:26
  • 1
    It's not a fact that *all* text can *always* be extracted correctly from *all* possible PDFs. Please post a link to one of the PDFs your code fails on so we can check if this cannot be done at all, or just due to a shortcoming of pyPDF. You can also try by copying your text with Adobe Reader – the canonical PDF reader, and one of the very best in text extraction. If it fails as well, chances are extremely small that a non-Adobe product can succeed. – Jongware Dec 07 '15 at 11:38

1 Answers1

0

pypdf as received a lot of updates in 2022. Especially text extraction was improved a lot. You can extract text like this:

from pypdf import PdfReader

reader = PdfReader("arabic.pdf")
full_text = ""
for page in reader.pages:
    full_text += page.extract_text() + "\n"
print(full_text)

If a specific PDF is causing issues, please report a bug. You need to share the pypdf version and the file that causes the issue.

You might also be interested in https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958