I am trying to extract text page wise from a PDF and store text as a list per page in a list like
[['This', 'is', 'one', 'page'] , ['I', 'am', 'page', 'TWO'] , ['Three', 'that\'s', 'me'] , ['and', 'so', 'on'] , ['...']]
I used the extractText()
method from the PyPDF2 package:
#!/usr/bin/python
from PyPDF2 import PdfFileReader
# open PDF
myPDFpath = 'test.pdf'
myPDF = PdfFileReader(open(myPDFpath, "rb"))
# initialize page list
pagelist = []
# grab all text from PDF per page and put into page list
for page in range(0, myPDF.getNumPages()):
currentPage = myPDF.getPage(page)
myText = currentPage.extractText()
thispage = myText.split()
pagelist.append(thispage)
The above code technically works, but the method is not reliable (as per own doc), throws outputs like:
[u'!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"',
So I was wondering if there is any other reliable way to parse text from a PDF file in Python?