1

I am Working on Pdfs to Excel conversion using docparser. But docparser is unable to process scanned pdfs properly. So I need to seperate the scanned pdfs from normal pdfs and only want to process normal pdfs through docparser(i.e API call). Is there exit some to way to identify the pdf type(Scanned or normal) programmatically so that I could work further? Please help if anyone knows how to tackle this problem.....

Amul Mittal
  • 54
  • 1
  • 9
  • 3
    As far as I know, on the PDF level there is no difference between a "normal" and a "scanned" pdf.. So, you would have to do it heuristically. For example, if all pages include an image that is 90%+ of paper size, it's a pretty good bet that it's a scanned PDF.. – xs0 Dec 11 '17 at 10:42
  • Some scanned pdf have a scanner brand tag in their meta-data so you can identify them by it. But if the scanner doesn't add or modify the meta-data of the created pdf I guess it would be really hard to identify it. – Ahmad Sanie Dec 11 '17 at 10:49
  • Thanks for giving your answers guys ,But when I used Tabula - Pdf to Excel tool and uploaded a scanned pdf, it shows me a pop-up that "the uploaded file is scanned image, it may not give correct results....".So I think there surely exists some way to identify scanned Pdfs... – Amul Mittal Dec 11 '17 at 11:51
  • *"So I think there surely exists some way to identify scanned Pdfs"* - as @xs0 indicated you have to use heuristics. E.g. pages with only image content but no text... – mkl Dec 11 '17 at 22:24
  • We just gave you two ways to do that :-) Also, if you had OCR available, you could look at its scores as another signal.. – xs0 Dec 11 '17 at 22:53
  • First of all I was wondering why this question has been downvoted? The question is good and up to standards and please mention the reason for downvoting. Secondly, **@AmulMittal** please have a look at [**this**](https://stackoverflow.com/questions/24184308/detect-if-a-pdf-is-created-from-a-scanned-document-using-ocr-pdfbox) answer. Also I recommend you to have a look at [**this**](https://docs.aspose.com/display/pdfnet/Find+whether+PDF+file+contains+images+or+text+only) –  Dec 12 '17 at 03:21

1 Answers1

1

Finally, I found a solution to my question.But not a standard one(I THINK SO). Thanks to the people who commented and provide some help.

Using Pdfbox library we can extract pages of scanned pdf and will compare each page to the instance of an image object(PDImageXObject),if it comes true , the page will be count as an image and we can count those images.If images are equal to number of pages in pdf. We will say it is a scanned pdf.

here is the code...

public static String testPdf(String filename) throws IOException
{
    String s = "";
    int g = 0;
    int gg = 0;
          PDDocument doc = PDDocument.load(new File(filename));

          gg = doc.getNumberOfPages();
          for(PDPage page:doc.getPages())
          {
              PDResources resource = page.getResources();
              for(COSName xObjectName:resource.getXObjectNames())
                {
                    PDXObject xObject = resource.getXObject(xObjectName);
                    if (xObject instanceof PDImageXObject)
                    {
                        ((PDImageXObject) xObject).getImage();
                        g++;
                    }


          }

          }
          doc.close();
         if(g==gg)  // pdf pages if equal to the images
         {
             return "Scanned pdf";
         }
         else
         {
             return "Searchable pdf";
         }



}
Amul Mittal
  • 54
  • 1
  • 9