Read pdf page by page

Question

I searched for my question and did not get my answer in the two available questions

Basically I want to iterate over each page because I want to select only that page which has a certain text.

I have used pyPdf. It works for almost i can say 90% of the pdfs but sometimes it does not extract the information from a page.

I have used the below code:

import pyPdf
extract = ""        
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
  ex = pdf.getPage(6)
  ex = ex.extractText()
  if re.search(r"to be held (at|on)",ex.lower()):
    print 'yes'
    print  ex ,"\n"
    extract = extract + ex + "\n" 
    continue

The above code works but sometimes some pages don't get extracted.

I also tried using pdfminer, but i could not find how to iterate the pdf in it page by page. pdfminer returns the entire text of the pdf.

I used the below code:

def convert_pdf_to_txt(path):
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  fp = file(path, 'rb')
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  pagenos=set()

 for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

   fp.close()
   device.close()
   retstr.close()
   return text

In the above code the text from the pdf comes from the for loop

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

In this how can I iterated on one page at a time.

The documentation on pdfminer is not understandable. Also there are many versions of the same.

So are there any other packages available for my question or can pdfminer be used for it?

score 6 · Answer 1 · answered Nov 24 '18 at 01:54

Because retstr will retain each page, you might consider altering your code by calling retstr.truncate(0) which clears the string each time, otherwise you're printing the entirety of what's already been read each time:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    retstr.truncate(0)
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

OMG man! the truncate line saved my life! I was about to give up and found your piece of code after a lot of research. its 2am and i finally gonna get some sleep. thanks — rodrigorf, Aug 24 '19 at 04:58

Rohan Amrute · Accepted Answer · 2016-01-07T07:24:43.080

I know it is not good to answer your own question but i think i may have figured out an answer for this question.

I think it is not the best way to do it, but still it helps me.

I used a combination of pypdf and pdfminer

The code is as below:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

There may be a better way to do it, but currently i found out this to be pretty good.

Nothing wrong with answering your own question. It can be useful to others after all. — Brecht Machiels, Feb 06 '16 at 12:19
@PiyushS.Wanare use PyPDF2 instead of pyPDF. For your reference: `pip install PyPDF2` then `from PyPDF2 import PdfFileReader` — Ishank Saxena, Apr 17 '21 at 19:05

score 0 · Answer 3 · answered Nov 24 '20 at 08:22

You can refer the following link to extract page by page text from PDF.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

PDFMiner Page by Page text Extraction

Read pdf page by page

3 Answers3