1

I have like 400 or more PDF files that together form a single text. Its like a book separated page by page. I need to programatically be able to search some keywords over the whole text.

So my first question is: is it better to search page by page or join all the PDFs in one big file first and then perform the search?

The second one is: what is the best way to make it? Is there already any good program or library out there?

By the way, I'm using PHP and Python, only.

Sarchophagi
  • 377
  • 2
  • 5
  • 20

1 Answers1

1

Use PyPdf, as described here.

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

for f in filelist:
    print keyword in getPDFContent(f)

It is faster and much simpler to search them one by one, because you can then simply loop over all the files and use the code on every file.

BrtH
  • 2,610
  • 16
  • 27
  • Nice! Tahnks BrTh! But the keyword part, I assume, is only illustrative right? Keyword searching is much more complex and costly than that specially over 400 pages. Im I right? I need to get full paragraphs containing the inputed keywords. I would really appreciate help and a code/library for that.. – Sarchophagi Aug 01 '14 at 23:14
  • The searching for the keywords isn't hard, that is exactly as it is shown in that example code. You just make all the text of a page into one string and see if that keyword is in that string: `keyword in getPDFContent(f)`. The more tricky part is giving feedback about the paragraph. You could try to make a list of strings of the paragraphs instead of one big string per page and search through that. I'm not going to write the code for that, but these hints should get you going. And I wouldn't worry a single bit about performance. Searching through 400 pages should only take a few minutes max – BrtH Aug 01 '14 at 23:46
  • Well, thank you very much. Wonder if same code will run normally with PyPDF2... Just going to try it! Thank you again – Sarchophagi Aug 03 '14 at 03:46