how to determine PDF file type using iTextSharp

Question

Is there a way to determine the type of PDF file: if existing PDF file is a scanned image or if it has been created from a data file using iTextSharp and C#?

What are your criteria? How do you differentiate between PDFs from a scanner and your other type of document? Is it the amount of characters printed? Is it the amount of page area covered by images? Is it the name of the program which created the PDF? iTextSharp can help you determining such values but you have to come up with the criteria beforehand. — mkl, Nov 17 '12 at 00:20
"How do you differentiate between PDFs from a scanner..." - you can't even select the text — ESB, Nov 17 '12 at 16:58
Hhmmm, that's not necessarily the case. There are scanning solutions which do some additional OCR and then enrich the scanned PDFs by invisible but selectable text. And on the other hand it is easy to *create* a PDF *from a data file using iTextSharp and C#* without it having any selectable text. So, can I interpret your question so that you actually want to differentiate between PDFs with selectable text and those without? — mkl, Nov 17 '12 at 18:09
@ESB PdfTextExtractor.GetTextFromPage() may help you out finding whether it contains any text or not. — Vinay, Nov 20 '12 at 12:20

score 0 · Answer 1 · edited May 23 '17 at 12:22

0

maybe you can add some metadata to the PDF you create with iTextSharp.

Read/Modify PDF Metadata using iTextSharp

edited May 23 '17 at 12:22

Community

1
1

answered Nov 16 '12 at 23:32

AbdElRaheim

1,384
6
8

I don't create them - I get tons of them in my folder and need to determine that without opening each pdf – ESB Nov 16 '12 at 23:49

score -1 · Answer 2 · answered Nov 16 '12 at 23:28

-1

Document Properties/Advanced/Pdf Producer

answered Nov 16 '12 at 23:28

dot.net5000

19
2

coloboxp · Answer 3 · 2013-08-21T12:01:00.567

I just made this method to replace the PDF Producer after searching the right place in the watch window of the PdfWriter object, it changes the "PDF Creator" in the PDF as it is not accessible by default:

    private static void ReplacePdfCreator(PdfWriter writer)
    {
        /*

         Warning
         * 
         This is not an option offered as is and i had to workaround it by using Reflection and change it
         manually.
         * 
         Alejandro

         */
        Type writerType = writer.GetType();
        PropertyInfo writerProperty =
            writerType.GetProperties(BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.Instance)
                      .FirstOrDefault(p => p.PropertyType == typeof(PdfDocument));

        if (writerProperty != null)
        {
            PdfDocument pd = (PdfDocument)writerProperty.GetValue(writer);
            Type pdType = pd.GetType();
            FieldInfo infoProperty =
                pdType.GetFields(BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.Instance)
                      .FirstOrDefault(p => p.Name == "info");

            if (infoProperty != null)
            {
                PdfDocument.PdfInfo pdfInfo = (PdfDocument.PdfInfo)infoProperty.GetValue(pd);

                if (pdfInfo != null)
                {
                    string creator = pdfInfo.GetAsString(new PdfName("Producer")).ToLowerInvariant();

        if(creator.Contains("itextsharp"))
        {
            // created with itext sharp
        }
        else if(creator.Contains("adobe"))
        {
            // created with adobe something (distiller, photoshop, whatever)
        }
        else if(creator.Contains("pdfpro"))
        {
            // created with pdf pro
        }
        else if(add your own comparison here, for example a scanner manufacturer software like HP's one)
        {
        }
                }
            }
        }
}

So where is the answer to the question..? Can you explain that as well..? — NREZ, Aug 16 '13 at 11:49
I pasted it in the wrong thread sorry, but explain what as well? However you can use this code with a small adaptation to determine how it was created, updated the code above.. — coloboxp, Aug 21 '13 at 11:54

how to determine PDF file type using iTextSharp

3 Answers3