Well, yes, you can: PDFMiner has API.
The basic example sais
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# do stuff with the page here
and in the loop you should go with
# receive the LTPage object for the page.
layout = device.get_result()
and then use LTTextBox
object. You have to analyse that. There's no full example in the docs, but you may check out the pdf2txt.py source which will help you to find missing pieces (although it does much more since it parses options and applies them).
And once you have the code that extracts the text, you can do what you want before saving a file. Including searching for certain parts of text.
PS looks like this was, in a way, asked before: How do I use pdfminer as a library which should be helpful, too.