2

I searched for my question and did not get my answer in the two available questions

  1. Extract text per page with Python pdfMiner?

  2. PDFMiner - Iterating through pages and converting them to text

Basically I want to iterate over each page because I want to select only that page which has a certain text.

I have used pyPdf. It works for almost i can say 90% of the pdfs but sometimes it does not extract the information from a page.

I have used the below code:

import pyPdf
extract = ""        
pdf = pyPdf.PdfFileReader(open('filename.pdf', "rb"))
num_of_pages = pdf.getNumPages()
for p in range(num_of_pages):
  ex = pdf.getPage(6)
  ex = ex.extractText()
  if re.search(r"to be held (at|on)",ex.lower()):
    print 'yes'
    print  ex ,"\n"
    extract = extract + ex + "\n" 
    continue

The above code works but sometimes some pages don't get extracted.

I also tried using pdfminer, but i could not find how to iterate the pdf in it page by page. pdfminer returns the entire text of the pdf.

I used the below code:

def convert_pdf_to_txt(path):
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  fp = file(path, 'rb')
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  pagenos=set()

 for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

   fp.close()
   device.close()
   retstr.close()
   return text

In the above code the text from the pdf comes from the for loop

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)

    text = retstr.getvalue()

In this how can I iterated on one page at a time.

The documentation on pdfminer is not understandable. Also there are many versions of the same.

So are there any other packages available for my question or can pdfminer be used for it?

Community
  • 1
  • 1
Rohan Amrute
  • 764
  • 1
  • 9
  • 23

3 Answers3

6

Because retstr will retain each page, you might consider altering your code by calling retstr.truncate(0) which clears the string each time, otherwise you're printing the entirety of what's already been read each time:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    retstr.truncate(0)
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue
  • OMG man! the truncate line saved my life! I was about to give up and found your piece of code after a lot of research. its 2am and i finally gonna get some sleep. thanks – rodrigorf Aug 24 '19 at 04:58
3

I know it is not good to answer your own question but i think i may have figured out an answer for this question.

I think it is not the best way to do it, but still it helps me.

I used a combination of pypdf and pdfminer

The code is as below:

import pyPdf
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

path = "filename.pdf"
pdf = pyPdf.PdfFileReader(open(path, "rb"))
fp = file(path, 'rb')
num_of_pages = pdf.getNumPages()
extract = ""
for i in range(num_of_pages):
  inside = [i]
  pagenos=set(inside)
  rsrcmgr = PDFResourceManager()
  retstr = StringIO()
  codec = 'utf-8'
  laparams = LAParams()
  device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
  interpreter = PDFPageInterpreter(rsrcmgr, device)
  password = ""
  maxpages = 0
  caching = True
  text = ""
  for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
    interpreter.process_page(page)
    text = retstr.getvalue()
    text = text.decode("ascii","replace")
    if re.search(r"to be held (at|on)",text.lower()):
        print text
        extract = extract + text + "\n" 
        continue

There may be a better way to do it, but currently i found out this to be pretty good.

Rohan Amrute
  • 764
  • 1
  • 9
  • 23
0

You can refer the following link to extract page by page text from PDF.

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

PDFMiner Page by Page text Extraction