How can I determine if a file is a PDF file?

Question

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to check if the provided file is indeed a valid PDF?

My Kingdom?? any link or description? – Technoshaft Sep 28 '18 at 15:51 — Technoshaft, Sep 28 '18 at 15:51

score 29 · Answer 1 · answered Feb 09 '10 at 11:10

Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:

public static void CheckIsPDF(byte[] data)
    {
        Assert.IsNotNull(data);
        Assert.Greater(data.Length,4);

        // header 
        Assert.AreEqual(data[0],0x25); // %
        Assert.AreEqual(data[1],0x50); // P
        Assert.AreEqual(data[2],0x44); // D
        Assert.AreEqual(data[3],0x46); // F
        Assert.AreEqual(data[4],0x2D); // -

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
        {                  
            // file terminator
            Assert.AreEqual(data[data.Length-7],0x25); // %
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x45); // E
            Assert.AreEqual(data[data.Length-4],0x4F); // O
            Assert.AreEqual(data[data.Length-3],0x46); // F
            Assert.AreEqual(data[data.Length-2],0x20); // SPACE
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
        {
            // file terminator
            Assert.AreEqual(data[data.Length-6],0x25); // %
            Assert.AreEqual(data[data.Length-5],0x25); // %
            Assert.AreEqual(data[data.Length-4],0x45); // E
            Assert.AreEqual(data[data.Length-3],0x4F); // O
            Assert.AreEqual(data[data.Length-2],0x46); // F
            Assert.AreEqual(data[data.Length-1],0x0A); // EOL
            return;
        }

        Assert.Fail("Unsupported file format");
    }

Thanks, this just helped me figure out what was going wrong with the PDF I was generating -- an EOL problem only showed in Adobe Reader, not Foxit/GoogleApps/Sumatra. — Michael Greene, Jun 08 '10 at 02:07
Is this in Java? Also it'll not detect encrypted PDFs. Since the OP wants to extract info you need that too. — cherouvim, Feb 18 '11 at 13:57
Thanks! I really appreciate that this answer is library agnostic. It saved me a bunch of time =) — Spina, Aug 10 '13 at 15:36
In version 1.3, the space after EOF does not always appear before the EOL. — Mr. Polywhirl, Mar 07 '19 at 14:29
For now (almost decade after original answer was placed) we have much more pdf versions, so be carefully if you intend just copy and paste above code! — 1_bug, Apr 17 '19 at 08:35
@1_bug you foreshadowing! I had a problema with the 1.6 format, for now, just checking the "25 50 44 46 2D" group! — Danielson Alves Júnior, Jan 16 '20 at 18:46

score 12 · Accepted Answer · answered Jun 06 '09 at 13:12

12

you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)

I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)

answered Jun 06 '09 at 13:12

Persimmonium

15,593
11
47
78

3

Oh, I forgot to mention there is now an apache project for text extraction, http://lucene.apache.org/tika/, in case you prefer it to aperture – Persimmonium Jun 08 '09 at 09:51
read the question properly: the question was NOT about using PDFBox, but on a way to 'check if the provided file is indeed a valid PDF' – Persimmonium Feb 18 '11 at 13:51
4

I see "using PdfBox by Apache" in the question's title. If the problem is solvable using PDFBox isn't it better than by introducing extra dependencies? – cherouvim Feb 18 '11 at 13:56

score 11 · Answer 3 · answered Feb 18 '11 at 13:47

11

Since you use PDFBox you can simply do:

PDDocument.load(file);

It'll fail with an Exception if the PDF is corrupted etc.

If it succeeds you can also check if the PDF is encrypted using .isEncrypted()

answered Feb 18 '11 at 13:47

cherouvim

31,725
15
104
153

2

From what I've seen, that's not true. I can use PDDocument.load( stream ) to load a corrupted PDF. I only get an error when attempting to save the PDF after modifying it's permissions. – MonkeyWrench May 21 '13 at 20:23
1

Using Exceptions for application flow is bad practice. – Ben Turner Aug 02 '13 at 08:28
1

@BenTurner: You are correct and I am with you on that. The API doesn't give us a way to check for file validity though. – cherouvim Aug 02 '13 at 09:01
This does not always throw an exception. http://stackoverflow.com/questions/20004290/how-to-set-a-load-timeout-why-does-not-pdfbox-throw-exception – Aleksei Nikolaevich Nov 15 '13 at 20:16
4

What about PDDocument.load(file).getNumberOfPages() ? This is what I do and I have not yet experienced a non-valid PDF-file where PDFBox could count the number of pages. – Vering Aug 25 '16 at 08:24
I have a corrupted PDF. It will be identified as corrupted by iText but not by PDDocument.load(file) of PDFBox!! – Mohsen Abasi Apr 22 '17 at 05:45

score 8 · Answer 4 · answered Feb 19 '16 at 23:47

Here an adapted Java version of NinjaCross's code.

/**
 * Test if the data in the given byte array represents a PDF file.
 */
public static boolean is_pdf(byte[] data) {
    if (data != null && data.length > 4 &&
            data[0] == 0x25 && // %
            data[1] == 0x50 && // P
            data[2] == 0x44 && // D
            data[3] == 0x46 && // F
            data[4] == 0x2D) { // -

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                data[data.length - 7] == 0x25 && // %
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x45 && // E
                data[data.length - 4] == 0x4F && // O
                data[data.length - 3] == 0x46 && // F
                data[data.length - 2] == 0x20 && // SPACE
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }

        // version 1.3 file terminator
        if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x25 && // %
                data[data.length - 4] == 0x45 && // E
                data[data.length - 3] == 0x4F && // O
                data[data.length - 2] == 0x46 && // F
                data[data.length - 1] == 0x0A) { // EOL
            return true;
        }
    }
    return false;
}

And some simple unit tests:

@Test
public void test_valid_pdf_1_3_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
}

@Test
public void test_valid_pdf_1_4_data_is_pdf() {
    assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
}

@Test
public void test_invalid_data_is_not_pdf() {
    assertFalse(is_pdf("Hello World".getBytes()));
}

If you come up with any failing unit tests, please let me know.

arjun kumar · Answer 5 · 2017-05-17T17:01:11.040

I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

Eventually, after tinkering around with different methods in the API, I tried this:

PDDocument.load(file).getPage(0).getContents().toString();

This did not throw an exception, but it did output this:

 WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015

Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

I then implemented the following:

        RandomAccessFile accessFile = new RandomAccessFile(file, "r");
        PDFParser parser = new PDFParser(accessFile); 
        parser.setLenient(false);
        parser.parse();

This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!

score 5 · Answer 6 · edited Apr 29 '15 at 05:29

5

You have to try this....

public boolean isPDF(File file){
    file = new File("Demo.pdf");
    Scanner input = new Scanner(new FileReader(file));
    while (input.hasNextLine()) {
        final String checkline = input.nextLine();
        if(checkline.contains("%PDF-")) { 
            // a match!
            return true;
        }  
    }
    return false;
}

edited Apr 29 '15 at 05:29

Niroshan

2,064
6
35
60

answered Feb 23 '14 at 17:21

Sheel

1,010
1
17
30

8

This answer troubles me... Are there PDF that does not begin with "%PDF-" but just contains it ? Why the trouble of reading the whole file ? What if I check a 2 GB zip file ? – boumbh Jul 15 '15 at 04:58
for larger files, files with size of 10+MB and wrong extensions (for example mp3File.pdf), it will take a lot of time (like 5 or more seconds) – shanraisshan Aug 04 '16 at 06:51

score 5 · Answer 7 · answered Jun 02 '09 at 21:15

5

Pdf files begin "%PDF" (open one in TextPad or similar and take a look)

Any reason you can't just read the file with a StringReader and check for this?

answered Jun 02 '09 at 21:15

cagcowboy

30,012
11
69
93

I have tried this, and it appears that PDF Files can use a variety of encodings and the text read sometimes does not match %PDF for valid and readable PDF files. – Jun 02 '09 at 21:19
5

Not all files that begin with %PDF are valid PDF files. – Kyle W. Cartmell Jun 02 '09 at 22:03

score 4 · Answer 8 · answered Apr 13 '16 at 20:13

Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's

You just need to import tika-app-latest*.jar

 public String parseToStringExample() throws IOException, SAXException, TikaException 
 {

      Tika tika = new Tika();
      try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
           return tika.parseToString(stream); // This should return you the pdf's text
      }
}

It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/

score 3 · Answer 9 · edited May 31 '17 at 08:50

The answer by Roger Keays is wrong! since not all PDF files in version 1.3 and not all terminated by EOL. The answer below works for all not corrupted pdf files:

public static boolean is_pdf(byte[] data) {
    if (data != null && data.length > 4
            && data[0] == 0x25 && // %
            data[1] == 0x50 && // P
            data[2] == 0x44 && // D
            data[3] == 0x46 && // F
            data[4] == 0x2D) { // -

        // version 1.3 file terminator
        if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                data[data.length - 7] == 0x25 && // %
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x45 && // E
                data[data.length - 4] == 0x4F && // O
                data[data.length - 3] == 0x46 && // F
                data[data.length - 2] == 0x20 // SPACE
                //&& data[data.length - 1] == 0x0A// EOL
                ) {
            return true;
        }

        // version 1.3 file terminator
        if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                data[data.length - 6] == 0x25 && // %
                data[data.length - 5] == 0x25 && // %
                data[data.length - 4] == 0x45 && // E
                data[data.length - 3] == 0x4F && // O
                data[data.length - 2] == 0x46 // F
                //&& data[data.length - 1] == 0x0A // EOL
                ) {
            return true;
        }
    }
    return false;
}

The `%%EOF` must be the only content of the last line of the PDF. Thus, files with a space after the `%%EOF` strictly speaking are invalid. There only may be a line delimiter after it, i.e. a single CR, a single LF, or a CR LF pair. — mkl, May 29 '17 at 13:13

score 3 · Answer 10 · answered May 08 '20 at 11:26

Relying on magic numbers does not really appeal to me. I ended up using a preflight library from Apache for this:

compile group: 'org.apache.pdfbox', name: 'preflight', version: '2.0.19'

private boolean isPdf(InputStream fileInputStream) {
    try {
        PreflightParser preflightParser = new PreflightParser(new ByteArrayDataSource(fileInputStream));
        preflightParser.parse();
        return true;
    } catch (Exception e) {
        return false;
    }
}

PreflightParser has constructors for files and other data sources.

Mr. Polywhirl · Answer 11 · 2019-03-07T16:14:47.123

Here is a method that checks for the presence of %%EOF with optional checks for white-space characters. You can pass in either a File or a byte[] object. There is less restriction for white-space characters in some PDF versions.

public boolean isPdf(byte[] data) {
    if (data == null || data.length < 5) return false;
    // %PDF-
    if (data[0] == 0x25 && data[1] == 0x50 && data[2] == 0x44 && data[3] == 0x46 && data[4] == 0x2D) {
        int offset = data.length - 8, count = 0; // check last 8 bytes for %%EOF with optional white-space
        boolean hasSpace = false, hasCr = false, hasLf = false;
        while (offset < data.length) {
            if (count == 0 && data[offset] == 0x25) count++; // %
            if (count == 1 && data[offset] == 0x25) count++; // %
            if (count == 2 && data[offset] == 0x45) count++; // E
            if (count == 3 && data[offset] == 0x4F) count++; // O
            if (count == 4 && data[offset] == 0x46) count++; // F
            // Optional flags for meta info
            if (count == 5 && data[offset] == 0x20) hasSpace = true; // SPACE
            if (count == 5 && data[offset] == 0x0D) hasCr    = true; // CR
            if (count == 5 && data[offset] == 0x0A) hasLf    = true; // LF / EOL
            offset++;
        }

        if (count == 5) {
            String version = data.length > 13 ? String.format("%s%s%s", (char) data[5], (char) data[6], (char) data[7]) : "?";
            System.out.printf("Version : %s | Space : %b | CR : %b | LF : %b%n", version, hasSpace, hasCr, hasLf);
            return true;
        }
    }

    return false;
}

public boolean isPdf(File file) throws IOException {
    return isPdf(file, false);
}

// With version: 16 bytes, without version: 13 bytes.
public boolean isPdf(File file, boolean includeVersion) throws IOException {
    if (file == null) return false;
    int offsetStart = includeVersion ? 8 : 5, offsetEnd = 8;
    byte[] bytes = new byte[offsetStart + offsetEnd];
    InputStream is = new FileInputStream(file);
    try {
        is.read(bytes, 0, offsetStart); // %PDF-
        is.skip(file.length() - bytes.length); // Skip bytes
        is.read(bytes, offsetStart, offsetEnd); // %%EOF,SP?,CR?,LF?
    } finally {
        is.close();
    }
    return isPdf(bytes);
}

ISO 32000-2 has been published for quite a while now. So... *" I also tailored this to check between PDF versions 1.3 and 1.7"* - you should also allow 2.0. — mkl, Mar 07 '19 at 15:49
@mkl I removed the check for version. There may be an issue displaying the version of the format changes from `x.y`. A safer check would be to look between the percentage signs e.g. `%xx.yyy%`. — Mr. Polywhirl, Mar 07 '19 at 16:16

score 1 · Answer 12 · answered Aug 16 '15 at 15:07

There is a very convenient and simple library for testing PDF content: https://github.com/codeborne/pdf-test

API is very simple:

import com.codeborne.pdftest.PDF;
import static com.codeborne.pdftest.PDF.*;
import static org.junit.Assert.assertThat;

public class PDFContainsTextTest {
  @Test
  public void canAssertThatPdfContainsText() {
    PDF pdf = new PDF(new File("src/test/resources/50quickideas.pdf"));
    assertThat(pdf, containsText("50 Quick Ideas to Improve your User Stories"));
  }
}

Why the downvote? This does answer the question. Maybe this solution is not as robust as the other answers but then the others should be upvoted more, no? — ssimm, Oct 02 '18 at 07:25

score 0 · Answer 13 · answered Jan 12 '18 at 07:41

In general, we can like this, any pdf version going to finish with %%EOF so we can check like bellow.

public static boolean is_pdf(byte[] data) {
        String s = new String(data);
        String d = s.substring(data.length - 7, data.length - 1);
        if (data != null && data.length > 4 &&
                data[0] == 0x25 && // %
                data[1] == 0x50 && // P
                data[2] == 0x44 && // D
                data[3] == 0x46 && // F
                data[4] == 0x2D) { // -

              if(d.contains("%%EOF")){
                 return true; 
              }         
        }
        return false;
    }

This code is smart but bugged: if you use the length of the byte array for calculate offsets in substring you could run in `out of bound exception` because the length of the String is not the same. So you have to use: `String d = s.substring(s.length() - 7, s.length() - 1);` — Izerlotti, Jul 26 '23 at 13:43

score 0 · Answer 14 · answered May 26 '23 at 09:43

We can user directly the below method , in which we will directly pass bytes of file data and it will return true(valid pdf) or false.

public boolean isPdf(byte[] data) {
    if (data == null || data.length < 5) return false;
    // %PDF-
    if (data[0] == 0x25 && data[1] == 0x50 && data[2] == 0x44 && data[3] == 0x46 && data[4] == 0x2D) {
        int offset = data.length - 8, count = 0; // check last 8 bytes for %%EOF with optional white-space
        boolean hasSpace = false, hasCr = false, hasLf = false;
        while (offset < data.length) {
            if (count == 0 && data[offset] == 0x25) count++; 
            if (count == 1 && data[offset] == 0x25) count++; 
            if (count == 2 && data[offset] == 0x45) count++; 
            if (count == 3 && data[offset] == 0x4F) count++; 
            if (count == 4 && data[offset] == 0x46) count++; 
            // Optional flags for meta info
            if (count == 5 && data[offset] == 0x20) hasSpace = true; 
            if (count == 5 && data[offset] == 0x0D) hasCr    = true; 
            if (count == 5 && data[offset] == 0x0A) hasLf    = true; 
            offset++;
        }
        if (count == 5) {
            String version = data.length > 13 ? String.format("%s%s%s", (char) data[5], (char) data[6], (char) data[7]) : "?";
            System.out.printf("Version : %s | Space : %b | CR : %b | LF : %b%n", version, hasSpace, hasCr, hasLf);
            return true;
        }
    }
    return false;
}

How can I determine if a file is a PDF file?

14 Answers14

Linked