2

Using perl, what is the best way to determine whether a file is a PDF?

Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528

Community
  • 1
  • 1
CJ7
  • 22,579
  • 65
  • 193
  • 321
  • How good the detection need to be? Do you just need to detect common PDF files to white list them or should it detect all files which could possibly opened as PDF to black list them? The latter is much harder since legal PDF files can actually contain data before the magic %PDF string and thus trick you in thinking that this is an image etc and not a PDF. – Steffen Ullrich Apr 01 '16 at 04:32
  • 1
    *Apparently, not all PDFs start with %PDF* - all *valid* pdfs (according to the specification) do start with "%PDF-1". Some pdf viewers accept invalid pdfs, too, though, and so leave a different impression. – mkl Apr 01 '16 at 05:45

2 Answers2

1

Detecting a PDF is not hard, but there are some corner cases to be aware of.

  1. All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
    • The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
    • The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
    • If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
  2. The end of a PDF will contain a line with '%%EOF',
    • The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
    • Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.

In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.

Patrick Gallot
  • 595
  • 3
  • 11
0

The module PDF::Parse has method called IsaPDF which

Returns true, if the file could be parsed and is a PDF-file.

Joel
  • 1,805
  • 1
  • 22
  • 22