How would you write an `is_pdf(path_to_file)` function in Python?

Question

I have a Django project that creates PDFs using Java as a background task. Sometimes the process can take awhile, so the client uses polling like this:

The first request starts the build process and returns None.
Each subsequent request checks to see if the PDF has been built.
- If it has been, it returns the PDF.
- If it hasn't, it returns None again and the client schedules another request to check again in n seconds.

The problem I have is that I don't know how to check if the PDF is finished building. The Java process creates the file in stages. If I just check if the PDF exists, then the PDF that gets returned is often invalid, because it is still being built. So, what I need is an is_pdf(path_to_file) function that returns True if the file is a valid PDF and False otherwise.

I'd like to do this without a library if possible, but will use a library if necessary.

I'm on Linux.

Here is a solution that works using pdfminer, but it seems like overkill to me.

from pdfminer.high_level import extract_text

def is_pdf(path_to_file):
    """Return True if path_to_file is a readable PDF"""
    try:
        extract_text(path_to_file, maxpages=1)
        return True
    except:
        return False

I'm hoping for a solution that doesn't involve installing a large library just to check if a file is a valid PDF.

There is another thread for validating a pdf file with python. This answer should suffice for you I think : https://stackoverflow.com/a/32654661/6430256 — AntiqTech, Oct 08 '20 at 22:01
Thanks, but I've reviewed that and it does not have the answer. `PyPDF2` is no longer maintained. There may be a solution using `ReportLab`, but I'm not sure how to do it. The solution using `Popen()` looked promising, but I couldn't make that work. — Webucator, Oct 08 '20 at 22:05
I see, Popen solution is for linux environment. I'm checking reportlab module but I haven't seen anything useful to validate a pdf so far. — AntiqTech, Oct 08 '20 at 22:54
I've found this https://pypi.org/project/pdfminer.six/ seems to be still maintained as of 2020 January. I wrote some code looking at the examples on the other thread, I will post it below. See if it is of any help to you. — AntiqTech, Oct 08 '20 at 23:26
"The problem I have is that I don't know how to check if the PDF is finished building" Could you instead check whether the build process is still running? Alternately, could you modify the Java program to produce some kind of signal of the build status that your program could then check? — Karl Knechtel, Oct 08 '20 at 23:31
This approach seems wrong on a basic level. The fact that a file is *correctly formatted* for a given file format doesn't mean that it's *actually complete*. — user2357112, Oct 08 '20 at 23:34
Even if this happens to work for PDF (I don't know enough about the format to tell either way), you'd be hardcoding a dependency on obscure details of one particular file format, which may not hold when you need to support a different format, or newer versions of the same format. — user2357112, Oct 08 '20 at 23:38
@karl-knechtel, that's a good idea, but unfortunately, the Java software is a blackbox. It's a jar file that doesn't give me any status info. — Webucator, Oct 08 '20 at 23:47
@user2357112-supports-monica, my use case is strictly PDF-based, and as I know the process by which the file is built, I can be pretty sure, the PDF won't be valid until that process is complete. — Webucator, Oct 08 '20 at 23:48
You may still be able to figure out from the operating system what the PID is for the Java process, and monitor it. Of course, if it's an always-running service, you may be out of luck. — Karl Knechtel, Oct 08 '20 at 23:55
I updated my answer, added another example for PDFParser and PDFDocument. Provided that open() function doesn't throw an exception, PDFDocument or PDFParser might throw one. If no exception is thrown, PDFDocument.info attribute might be useful. — AntiqTech, Oct 09 '20 at 00:00

AntiqTech · Answer 1 · 2020-10-09T00:12:52.630

0

I've found this pypi.org/project/pdfminer.six . I produced a simple example. See if it is useful to you. a.pdf is an empty file. I don't know what it will do when trying to read a pdf file which is still being processed by another program.

from pdfminer.high_level import extract_text

try:
 text = extract_text("D:\\a.pdf")
 print(text)
except :
 print("invalid PDF file")
else:
 pass

--- update --

Alternatively, I have seen an example of PDFDocument on pdfminer github, https://github.com/pdfminer/pdfminer.six/blob/develop/tools/pdfstats.py on line 53.

I produced a similar example code:

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser

try:
 pdf_file = open("D:\\a.pdf", 'rb')
 parser = PDFParser(pdf_file)
 password = ''
 document = PDFDocument(parser, password)
 print(document.info)
 print(document.xrefs)
except :
 print("invalid PDF file")
else:
 pass

In my example, since a.pdf is empty; open() function throws the exception. In your case, I'm guessing it will be able to open the file but PDFParser or PDFDocument may throw an exception. If no exception is thrown, PDFDocument.info attribute might be useful.

-- update 2 --

I've realized that document object has xrefs attribute. there is an explanation in PdfParser class : "It also reads XRefs at the end of every PDF file." Checking the value of document.xrefs might be useful.

edited Oct 09 '20 at 00:12

answered Oct 08 '20 at 23:30

AntiqTech

717
1
6
10

1

This is very similar to the solution I added to my question after your initial comment. It's the best solution I've come up with so far, but it seems like overkill. If the PDF is large, which mine are, it has to extract all the text just to see if it's a valid PDF. – Webucator Oct 08 '20 at 23:39
Someone voted this down, presumably because it's so similar to the code I added to the question, but to be fair to @AntiqTech, I added that solution after their original comment. – Webucator Oct 08 '20 at 23:41
1

Ah I didn't realize you added that your question. While I was checking pdfminer, I have found this file on its github. https://github.com/pdfminer/pdfminer.six/blob/develop/tools/pdfstats.py There is this PDFDocument() example on line 53. PDFDocument object has an info attribute. May be you can try it to extract info. If it throws an exception, you may decide it is invalid. I'll add the example code to my answer. – AntiqTech Oct 08 '20 at 23:49
This works and may be a good alternative. I expect it's more efficient than extracting all the text from the PDF. I'm going to hold out for a bit to see if someone can provide an answer that doesn't involve installing and importing such a big library. – Webucator Oct 09 '20 at 00:09
I've realized that Document has another attribute : xrefs . Seems that PDFParser reads the xrefs at the end of a PDF file. This means, your pdf files should not have xrefs since they are still being processed even if they can be opened. Checking document.xrefs might be useful. – AntiqTech Oct 09 '20 at 00:15
1

checking for `xrefs` is not necessary as `PDFDocument()` fails if the PDF is not valid. FYI, I've updated the code in my question to include `maxpages=1` so it doesn't extract the text for the whole PDF when the PDF is valid. – Webucator Oct 09 '20 at 09:33

score -1 · Answer 2 · answered Oct 09 '20 at 01:24

I suspect you could just write a script to email yourself or team distribution and simply list all the files in the directory. However, if you're only asking how to natively search a directory without installing modules. I would import os and re.

# ***** Search File *****
files = os.listdir(r"C:\Users\PATH")
print(files)

How would you write an `is_pdf(path_to_file)` function in Python?

2 Answers2

Linked