7

The conserned website primary work is to accept files from users and save it. Every thing was fine till 2 months back when i was told to enforce a constraint to accept only pdf files.

Before that users were in the habit of submitting various formats from text,rtf to good pdf.

I applied the constraint by checking the file extention --simple right?? however when the admin checked those files some good 60% of the files were corrupt.

I spent many sleepless nights to determine the cause of curruption then suddenly i thought may be they are submitting corrupt files.

I took the previous records and determined the favourite format of file type of some users from whome we were getting corrupt files.

I changed the extention back to there favourite extention and boom.. the file opened.

what I came to know however dispite telling in bold to user how to convet there files to pdf some(many) were just changing the extention and submitting. Since the website rewards the users on no. of file submitted administration people are grunting at me. Is there any way i can check the file is pdf or not without relying on the extention??

I am using fileupload in c# 3.5 asp.net

Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
Ratna
  • 2,289
  • 3
  • 26
  • 50
  • Look at the POST mimetype. – Jakob Bowyer Apr 15 '13 at 11:34
  • how?? i have set it to application/binary – Ratna Apr 15 '13 at 11:36
  • 1
    There's a special character sequence at the beginning of every PDF, just check that. – Ambar Apr 15 '13 at 11:37
  • 4
    Check whether the file starts with **%PDF-** as the PDF specification requires: *The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7.* (Cf. [ISO-32000-1:2008](http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) section 7.5.2.) – mkl Apr 15 '13 at 11:38
  • @mkl can you provide me some code..to do that – Ratna Apr 15 '13 at 11:39
  • Reading the first few bytes of a file should not be too difficult. I'm not actively programming .Net languages, though. – mkl Apr 15 '13 at 11:41
  • You can read the file using `StreamReader` object in C# – Ketan Modi Apr 15 '13 at 11:45
  • [Here is a link which may help your to read the file](http://support.microsoft.com/kb/323246) – शेखर Apr 15 '13 at 11:47

2 Answers2

22

As all PDF files start with the ASCII string "%PDF-", simply test the first few bytes of the file to ensure that they start with this string.

bool IsPdf(string path)
{
    var pdfString = "%PDF-";
    var pdfBytes = Encoding.ASCII.GetBytes(pdfString);
    var len = pdfBytes.Length;
    var buf = new byte[len];
    var remaining = len;
    var pos = 0;
    using(var f = File.OpenRead(path))
    {
        while(remaining > 0)
        {
            var amtRead = f.Read(buf, pos, remaining);
            if(amtRead == 0) return false;
            remaining -= amtRead;
            pos += amtRead;
        }
    }
    return pdfBytes.SequenceEqual(buf);
}
spender
  • 117,338
  • 33
  • 229
  • 351
  • Thankx man it was easy, with little modification to your code it worked.Thankx again. – Ratna Apr 15 '13 at 12:00
  • Two comments on this. First of all, while the current PDF specification is rather strict about this, the older ones were not so strict. Adobe Acrobat used to (not sure about current version) accept any file that has the %PDF- string in the first 1024 bytes of the file (and accept what preceded it as rubbish). Secondly, under this assumption a simple text file starting with the text "%PDF-" would be accepted as a valid PDF file. I hope your file submitters aren't very smart :) – David van Driessche Apr 15 '13 at 22:10
  • 1
    This is a convoluted solution for something as simple as reading and comparing 5 bytes. – Mustafa Ozturk Jan 16 '19 at 19:54
  • 1
    I've just rejected an edit to this question that ignored the return value of read in an attempt to "simplify" this code. Return value of Stream.Read should never be ignored – spender Jan 16 '19 at 20:35
  • @MustafaOzturk Feel free to contribute an answer if you feel there is a more efficient means to acheive this. I'll happily vote it up. – spender Jan 25 '19 at 12:56
7

I've found this site very useful in helping to determine if a file matches its extension. It's a huge list of file signatures that you can use with spender's code.

khelmar
  • 91
  • 4