0

So I've just played around with PDFMiner and can now extract text from a PDF and throw it into an html or textfile.

pdf2txt.py -o outputfile.txt -t txt inputfile.pdf

I have then written a simple script to extract all certain strings:

with open('output.txt', 'r') as searchfile:
for line in searchfile:
    if 'HELLO' in line:
        print(line)

And now I can use all these strings containing the word HELLO to add to my databse if that is what I wanted.

My questions is:

Is the only way or can PDFinder grab conditional stuff before even spitting it out to the txt, html or even straight into the database?

James Kolber
  • 7
  • 1
  • 6

1 Answers1

1

Well, yes, you can: PDFMiner has API.

The basic example sais

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    # do stuff with the page here

and in the loop you should go with

    # receive the LTPage object for the page.
    layout = device.get_result()

and then use LTTextBox object. You have to analyse that. There's no full example in the docs, but you may check out the pdf2txt.py source which will help you to find missing pieces (although it does much more since it parses options and applies them).

And once you have the code that extracts the text, you can do what you want before saving a file. Including searching for certain parts of text.

PS looks like this was, in a way, asked before: How do I use pdfminer as a library which should be helpful, too.

Community
  • 1
  • 1
YakovL
  • 7,557
  • 12
  • 62
  • 102
  • Would you consider this a better solution than mine above? I could imagine this would take just as much server power to handle. – James Kolber Aug 07 '16 at 18:38
  • @JamesKolber well, it is "strictly better" meaning that you can make your server do less (including not writing to the disk/DB untill the data is processed), but I'm not sure "how much better" this is: depending on context, this may be a minor improvement not deserving spending time on it or a major one. This depends on the expected "compression rate" and volume of information and also on the hardware etc.. By the way, if this answers your question, don't forget to "accept" it (the tick below the upvote/downvote buttons) – YakovL Aug 07 '16 at 22:34
  • @JamesKolber have I answered your question? If so, please accept the answer and point what's missing otherwise. – YakovL Aug 26 '16 at 10:20