1

I am trying to extract text page wise from a PDF and store text as a list per page in a list like

[['This', 'is', 'one', 'page'] , ['I', 'am', 'page', 'TWO'] , ['Three', 'that\'s', 'me'] , ['and', 'so', 'on'] , ['...']]

I used the extractText() method from the PyPDF2 package:

#!/usr/bin/python

from PyPDF2 import PdfFileReader

# open PDF
myPDFpath = 'test.pdf'
myPDF = PdfFileReader(open(myPDFpath, "rb"))

# initialize page list
pagelist = []

# grab all text from PDF per page and put into page list    
for page in range(0, myPDF.getNumPages()):
    currentPage = myPDF.getPage(page)
    myText = currentPage.extractText()
    thispage = myText.split()
    pagelist.append(thispage)

The above code technically works, but the method is not reliable (as per own doc), throws outputs like:

[u'!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', u'"!"#$"%#&\'"()"', 

So I was wondering if there is any other reliable way to parse text from a PDF file in Python?

birgit
  • 1,121
  • 2
  • 21
  • 39
  • You mean the phrase "This works well for some PDF files, but poorly for others, depending on the generator used."? It's true. Not *all* text in *all* PDFs can *always* be extracted. Post a link to your problematic PDF and we can tell if this is such one. – Jongware Sep 07 '15 at 06:33
  • The file I used is here: http://a.uguu.se/yefsbf_testdocx-pdf.pdf It was generated with the Print/Pdf... function in MS Word from a docx document. If there's a way to generate a pdf from a docx that does not raise these issues (but keeps pages etc intact) that'd be great – birgit Sep 07 '15 at 06:41
  • Quite surprising: my own tool does better than PyPDF2, but makes a mistake in decoding the font: `% -- Plain text dump ---------------- I"am"page"1.""I"am"page"1.""I"am"page"1.""` (etc.). The space character gets translated into `"`! Still, the same thing happens when copying the text with Adobe Acrobat, and that's the touchstone regarding being able to copy text. Examining the PDF shows we're both correct, and according to the embedded `/ToUnicode` the 'space' indeed translates to a double quote. – Jongware Sep 07 '15 at 10:01
  • @Jongware - interesting! Is your own tool available somewhere? Thanks – birgit Sep 07 '15 at 20:22
  • Sorry, it's way too crude for the general population. It can dump useful data per object, page, et cetera but interpreting that dump still needs thorough knowledge of the PDF specs. Back to your problem: (a) it seems pdfminer is not (yet?) there, but also (b) you happen to have a not-quite conforming PDF on your hands, which complicates things. – Jongware Sep 07 '15 at 20:49
  • is there a certain way to convert a .docx to a "conforming pdf" ? – birgit Sep 07 '15 at 23:22

1 Answers1

0

Well, you could try this:

import PyPDF2

pages = []
pdf_file = <Enter your file path>
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(number_of_pages):   # use xrange in Py2
    page = read_pdf.getPage(page_number).extractText().split(" ")  # Extract page wise text then split based on spaces as required by you
    pages.append(page)
Anjali
  • 508
  • 6
  • 17