36

Is there any way of checking if a byte[] is a pdf without opening?

I have some code to display a list of byte[] as pdf thumbnails. I previously knew all the byte[] were pdf's because we filtered the servlet to only return these. Now the requirement has changed and I need to bring all file types back. Is there any way of checking what the byte[] is, or more specifically determining if it isn't, a pdf?

Randy Levy
  • 22,566
  • 4
  • 68
  • 94
rik
  • 1,279
  • 4
  • 20
  • 29
  • 1
    Maybe this can be of some assistance: http://stackoverflow.com/questions/2731917/how-to-detect-if-a-file-is-pdf-or-tiff – JWL_ May 31 '11 at 11:41
  • 2
    -1: Open a hex editor and see the header of a PDF. Not hard. Answer: `%PDF` is the 1st 4 bytes. – leppie May 31 '11 at 11:41
  • @leppie: some formats haven't such specifications (like csv for example). So, until you find "official" specification - it's very bad to just "open a hex editor". For example, JPEG format is not so easy :) – chopikadze Jan 03 '12 at 06:25
  • @chopikadze: Who was talking about other file formats except you? And yes JPEG is easy, `FF D8 DD E0` – leppie Jan 03 '12 at 06:42
  • 1
    @leppie: JPEG is FF D8 *FF*, and instead of E0 sometimes (from photo cameras) you can get E1. In general, I meant that sometimes formats is not so easy as it is at first glance. Nothing more. – chopikadze Jan 03 '12 at 07:36
  • @chopikadze: Oops, that `DD` was a typo :) – leppie Jan 03 '12 at 09:10

6 Answers6

62

Check the first 4 bytes of the array.

If those are 0x25 0x50 0x44 0x46 then it's most probably a PDF file.

  • 7
    I used this answer for quite some years and now i'm staring to a PDF that starts with 0xEF 0XBB 0XBF . Any idea? – MichaelD Feb 09 '15 at 20:36
  • 8
    It appears these bytes are prepended to a UTF-8 formatted PDF. This means you cannot blindly check on 0x25 0x50 ... – MichaelD Feb 09 '15 at 20:55
  • Older PDF files could have the `%PDF` magic anywhere in the first 1,024 bytes, so this technique won't always work on all PDF files. – hippietrail Nov 23 '19 at 16:53
  • 1
    @MichaelD. That is the UTF-8 [BOM \(Byte Order Mark\)](https://en.wikipedia.org/wiki/Byte_order_mark). It can appear at the beginning of just about any file format which is in UTF-8 Unicode text, whether or not the specs say so. A bit annoying really. – hippietrail Nov 23 '19 at 16:55
  • *"It can appear at the beginning of just about any file format which is in UTF-8 Unicode text, whether or not the specs say so."* - "It can" as in "In the wild one can find such documents" but not as in "This prefix is valid". If the specific specification requires utf-8 text to be used in a special way, e.g. without bom, then a bom is invalid there. – mkl Nov 23 '19 at 19:10
  • 1
    *"Older PDF files could have the %PDF magic anywhere in the first 1,024 bytes"* - again yes, as in "there are documents in the wild like that which mostly are older", but nonetheless those pdfs never were "valid" pdfs, merely processors were lax enough to ignore that error. – mkl Nov 23 '19 at 19:15
17

First four bytes should be: 0x25 0x50 0x44 0x46 (in hex format, in ASCII it's %PDF). "Magic numbers" for another formats you can find here

chopikadze
  • 4,219
  • 26
  • 30
11

As far as I know all PDF's start with %PDF, so you could check the first bytes against this string.

DanielB
  • 19,910
  • 2
  • 44
  • 50
7

While the marked answer and the other answers are correct, they will not be successful 100% of the time. The problem is the PDF spec says the %PDF-1.x only needs to be in the first 1024 bytes and not the first 4. Some programs will add information before %PDF and still be valid.

I would recommend seeing the answer for the following Stack Overflow question: How to detect if a file is PDF or TIFF?

Community
  • 1
  • 1
  • 2
    *The problem is the PDF spec says the %PDF-1.x only needs to be in the first 1024 bytes and not the first 4* - This is wrong, the specification (ISO 32000-1) clearly says "**The first line of a PDF file shall be a header consisting of the 5 characters %PDF- followed by a version number of the form 1.N, where N is a digit between 0 and 7**". Even the Adobe PDF references similarly say "The first line of a PDF file is a header identifying the version of the PDF specification to which the file conforms" and offer the same variants as the specification. Merely... – mkl Mar 11 '16 at 11:32
  • 1
    ... Merely the ***implementation notes*** of the Adobe PDF references say that "**Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.**" Thus, "Some programs will add information before %PDF and still be valid." is wrong, the created PDFs are ***not valid***, they merely are accepted and displayed by a number of viewers in spite of being broken; they also are rejected by numerous other PDF processors. – mkl Mar 11 '16 at 11:34
  • The values for %PDF-1.x can appear further than the first few characters and still be valid, contrary to what you mention. I have several valid Pdf files that have the %PDF-1.x occur outside of the first 8 characters. Thus the reason I was searching for a good answer to resolve this problem. Unfortunately all but 1 post says use the first few characters and match to %PDF-1.x. Having a few files that are valid and failing that approach caused me to point out that the method of only checking the first few characters is not always valid, as I said in the post and recommended the other method. – Consulting Mechanic Mar 11 '16 at 17:41
  • 5
    By which criteria do you call them valid? They clearly violate the specification (which is the ISO norm, not some Adobe reference). Selected products like Adobe acrobat and reader may accept those files but that doesn't make them valid. – mkl Mar 11 '16 at 18:05
  • @mkl: The older versions of the spec detailed the part about `%PDF` being anywhere in the first 1024 bytes. I have read that the spec was changed when it went from being a proprietary Adobe spec to an open spec. In any case it's up to each implementer if they want to recognize and or support files which conformed to prior versions of the spec or only the current version of the spec. – hippietrail Nov 23 '19 at 17:01
  • 1
    No, the older pdf references also required a pdf file to start with `%PDF`. Merely in the *implementation notes* it states that *Acrobat viewers* require only that it appears in the first 1024 bytes of the file. Thus, the reader was laxer than the specification. – mkl Nov 23 '19 at 17:32
1

If anyone wants some C# code based on looking for "%PDF" in the first 1024 bytes, here's some:

    public bool IsAPdf(byte[] bytes) { 
        if(bytes?.Length < 4) return false;
        var stopBefore = Math.Min(bytes.Length, 1024) - 3;
        for(var i = 0; i < stopBefore; i++)  
            if(bytes[i] == '%' 
                && bytes[i+1] == 'P' 
                && bytes[i+2] == 'D' 
                && bytes[i+3] == 'F') return true; 
        return false;
    }
mrrrk
  • 2,186
  • 20
  • 17
-1

I've been having this problem. We use some Magic library from GitHub that determines content as PDF very well. However, we've been receiving some files that

  1. do open in PDF readers
  2. do have different start bytes (5) before %PDF-
  3. Do end with these 8 bytes 0A 0D 0A 30 0D 0A 0D 0A

So, I've added logic to check for these starting bytes 5-9, and 8 bytes in the end, when a file with PDF extension is not matched otherwise.

T.S.
  • 18,195
  • 11
  • 58
  • 78
  • In a current software putting anything before the %PDF or after the %%EOF can be considered a bug (unless the pdfs are not meant for distribution but merely for some special printer queue for example). – mkl Oct 15 '20 at 06:25
  • Do you mean "pdf software"? I don't know where clients get these files. But they do. Is there an official reading on this? Because if I can prove that having these bytes is illegal, we might just push it back against clients – T.S. Oct 15 '20 at 13:21
  • The PDF specification clearly requires %PDF in the first line and %%EOF in the last line of the PDF. See ISO 32000 parts 1 and 2 – mkl Oct 15 '20 at 16:09
  • @mkl Thank you. But is this fair to say that in the condition when PDF reader can open a file, and my program can determine that this file in fact has `%PDF` close to beginning and `%%EOF` close to end, this is a PDF? – T.S. Oct 15 '20 at 16:19
  • Actually PDF viewers are *very* lax. They ignore/repair many errors without telling the user. But they usually only display files that conceptually are pdfs (unlike word processor that often also accept plain text or HTML in addition to actual word processor formats). – mkl Oct 15 '20 at 17:16