Is there a way to determine the type of PDF file: if existing PDF file is a scanned image or if it has been created from a data file using iTextSharp and C#?
Asked
Active
Viewed 3,103 times
6
-
1What are your criteria? How do you differentiate between PDFs from a scanner and your other type of document? Is it the amount of characters printed? Is it the amount of page area covered by images? Is it the name of the program which created the PDF? iTextSharp can help you determining such values but you have to come up with the criteria beforehand. – mkl Nov 17 '12 at 00:20
-
"How do you differentiate between PDFs from a scanner..." - you can't even select the text – ESB Nov 17 '12 at 16:58
-
Hhmmm, that's not necessarily the case. There are scanning solutions which do some additional OCR and then enrich the scanned PDFs by invisible but selectable text. And on the other hand it is easy to *create* a PDF *from a data file using iTextSharp and C#* without it having any selectable text. So, can I interpret your question so that you actually want to differentiate between PDFs with selectable text and those without? – mkl Nov 17 '12 at 18:09
-
1@ESB PdfTextExtractor.GetTextFromPage() may help you out finding whether it contains any text or not. – Vinay Nov 20 '12 at 12:20
3 Answers
0
maybe you can add some metadata to the PDF you create with iTextSharp.

Community
- 1
- 1

AbdElRaheim
- 1,384
- 6
- 8
-
I don't create them - I get tons of them in my folder and need to determine that without opening each pdf – ESB Nov 16 '12 at 23:49
-1
I just made this method to replace the PDF Producer after searching the right place in the watch window of the PdfWriter object, it changes the "PDF Creator" in the PDF as it is not accessible by default:
private static void ReplacePdfCreator(PdfWriter writer)
{
/*
Warning
*
This is not an option offered as is and i had to workaround it by using Reflection and change it
manually.
*
Alejandro
*/
Type writerType = writer.GetType();
PropertyInfo writerProperty =
writerType.GetProperties(BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.Instance)
.FirstOrDefault(p => p.PropertyType == typeof(PdfDocument));
if (writerProperty != null)
{
PdfDocument pd = (PdfDocument)writerProperty.GetValue(writer);
Type pdType = pd.GetType();
FieldInfo infoProperty =
pdType.GetFields(BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.Instance)
.FirstOrDefault(p => p.Name == "info");
if (infoProperty != null)
{
PdfDocument.PdfInfo pdfInfo = (PdfDocument.PdfInfo)infoProperty.GetValue(pd);
if (pdfInfo != null)
{
string creator = pdfInfo.GetAsString(new PdfName("Producer")).ToLowerInvariant();
if(creator.Contains("itextsharp"))
{
// created with itext sharp
}
else if(creator.Contains("adobe"))
{
// created with adobe something (distiller, photoshop, whatever)
}
else if(creator.Contains("pdfpro"))
{
// created with pdf pro
}
else if(add your own comparison here, for example a scanner manufacturer software like HP's one)
{
}
}
}
}
}

coloboxp
- 494
- 8
- 15
-
So where is the answer to the question..? Can you explain that as well..? – NREZ Aug 16 '13 at 11:49
-
I pasted it in the wrong thread sorry, but explain what as well? However you can use this code with a small adaptation to determine how it was created, updated the code above.. – coloboxp Aug 21 '13 at 11:54