0

Please do not use "tika" for an answer. I have already tried answers from this question:

How to extract text from a PDF file?

I have this PDF file, https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing , and I would like to copy the text.

import PyPDF2
pdfFileObject = open('C:\\Path\\To\\Local\\File\\Test_PDF.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

The output is "Date Submitted: 2019-10-21 16:03:36.093 | Form Key: 5544" which is only part of the text. The next line of text starts with "Exhibit A to RFA...."

Jortega
  • 3,616
  • 1
  • 18
  • 21
  • Can you explain what you mean by `which is only part of the text`? The reader reads line by line and hence it's giving the right sequential output. – AzyCrw4282 Jul 30 '20 at 19:44
  • @ AzyCrw4282 I am trying to get all the text in the PDF not just the first line. – Jortega Jul 30 '20 at 19:56

2 Answers2

1

I have never used PYPDF2 myself so can't really input my knowledge to find out exactly what's going wrong. But the following from the documentation states the following about the function extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Here's an alternative way to get around this and also exaplains what maybe going wrong. I would also recommend using pdftotext. This has worked reliably for me many times; this answer will also prove helpful in that.

AzyCrw4282
  • 7,222
  • 5
  • 19
  • 35
1

Found a solution.

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
Jortega
  • 3,616
  • 1
  • 18
  • 21