13

I am currently analyzing a set of PDF files. I want to know how many of the PDF files fall in those 3 categories:

  • Digitally Created PDF: The text is there (copyable) and it is guaranteed to be correct as it was created directly e.g. from Word
  • Image-only PDF: A scanned document
  • Searchable PDF: A scanned document, but an OCR engine was used. The OCR engine put text "below" the image so that you can search / copy the content. As OCR is pretty good, this is correct most of the time. But it is not guaranteed to be correct.

It is easy to identify Image-only PDFs in my domain as every PDF contains text. If I cannot extract any text, it is image only. But how do I know if it is "just" a searchable PDF or if it is a digially created PDF?

By the way, it is not as simple as just looking at the producer as I have seen scanned documents where the Producer field said "Microsoft Word".

Note: As a human, it is easy. I just zoom in on the text. If I see pixels, it's "just" searchable.

Here are 3 example PDF files to test solutions:

What I tried/thought about

  • Using the creator/producer: I see "Microsoft Word" in scanned documents. Also this would be tedious.
  • Embedded fonts: You can extract embedded fonts. The idea was that a scanned document would not have embedded fonts but just use the default. The idea was wrong, as one can see with the example.
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 1
    Does this answer your question? [How to check if PDF is scanned image or contains text](https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text) – Nathan Aug 19 '20 at 21:29
  • If there is an image as big as the page yet it has text? – dawg Aug 19 '20 at 21:30
  • @Nathan No, it doesn't. While the question seems to be the same, the answers focus on the text extraction part. I'm not interested in text extraction. I want to know if the document was OCR-ed or not. – Martin Thoma Aug 20 '20 at 08:38
  • @Nathan [This answer](https://stackoverflow.com/a/61149317/562769) tries to answer my question (I think), but is a bash script instead of Python code – Martin Thoma Aug 20 '20 at 08:39
  • 1
    @MartinThoma That answer renders the PDF two times: Once with the text preserved, once with the text stripped. It then does a pairwise image comparison between the output pages. – ypnos Aug 20 '20 at 09:13
  • @Nathan In the bottom this answer says " this solution is not able to distinguish between full-textual PDFs and scanned PDFs that also have text within them" - which is exactly my question. – Martin Thoma Aug 20 '20 at 09:22
  • @ypnos Oh, nice! That sounds like a solution that could work with mupdf: [Render Page](https://pymupdf.readthedocs.io/en/latest/faq.html#how-to-make-images-from-document-pages) - I just need to figure out how to remove the text. The image comparison should be rather easy – Martin Thoma Aug 20 '20 at 09:29
  • Instead of removing text, you could also mark it. – ypnos Aug 20 '20 at 09:49
  • I think this is an oversimplification: "The text is there (copyable) and it is guaranteed to be correct as it was created directly e.g. from Word". *All* PDF files are "digitally created" (`:)`) but I see why you make the distinction. However, it is a common misconception that you can always copy all text correctly from all PDFs – even when it's output from reliable software. – Jongware Aug 20 '20 at 15:58
  • @usr2564301 Yes, it is an oversimplification. But I think the point is pretty clear :-) – Martin Thoma Aug 20 '20 at 16:16

3 Answers3

4

With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.

As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr ("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
3

Modified this answer from How to check if PDF is scanned image or contains text

In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).

I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.

I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.

Modified code:

import fitz # pip install PyMuPDF

def page_type(page):

    page_area = abs(page.rect) # Total page area

    img_area = 0.0
    for block in page.getText("RAWDICT")["blocks"]:
        if block["type"] == 1: # Type=1 are images
            bbox=block["bbox"]
            img_area += (bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
    img_perc = img_area / page_area
    print("Image area proportion: " + str(img_perc))

    text_area = 0.0
    for b in page.getTextBlocks():
        r = fitz.Rect(b[:4])  # Rectangle where block text appears
        text_area = text_area + abs(r)
    text_perc = text_area / page_area
    print("Text area proportion: " + str(text_perc))

    if text_perc < 0.01: #No text = Scanned
        page_type = "Scanned"
    elif img_perc > .8:  #Has text but very large images = Searchable
        page_type = "Searchable text" 
    else:
        page_type = "Digitally created"
    return page_type


doc = fitz.open(pdffilepath)

for page in doc: #Iterate through pages to find different types
    print(page_type(page))
  • Because you're summing image area over blocks, you need to divide by number of text blocks, to get correct average image area. Also `page.getTextBlocks()` is deprecated and need to be replaced. It's unclear by what. Perhaps `page.get_text('words')` or `page.get_text_blocks()`? – not2qubit Jul 20 '22 at 14:01
  • @not2qubit I don't think that is right. We want to get the total area covered by the text blocks in relation to the area of the page. If you divide it by the number of blocks you get the average area of the blocks, which is not what we want. – Manuel Ruiz Jul 22 '22 at 14:55
  • Yes, I think I messed that comment up, it need to be divided by number of pages. I kept getting 250% results for multi-page documents, which is obviously wrong. – not2qubit Jul 24 '22 at 10:20
  • @not2qubit It is actuallt working page by page, so no need to divide by number of pages either. However, I had also noticed that issue with higher than 100% percentages, but couldn't figure out why. In spite of this, the code does correctly identify the type of page correctly (searchable, scanned, digitally created), at least for my data. – Manuel Ruiz Jul 25 '22 at 14:15
-1

You can do it through bash script.

    #!/bin/bash

    echo "shellscript $0"
    ls --color --group-directories-first
    read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
    if [ "$ans" != "y" ]
    then
     exit
    fi

    mkdir -p scanned
    mkdir -p text
    mkdir -p "s-and-t"

    for file in *.pdf
    do
     grep -aq '/Image/' "$file"
     if [ $? -eq 0 ]
     then
      image=true
     else
      image=false
     fi
     grep -aq '/Text' "$file"
     if [ $? -eq 0 ]
     then
      text=true
     else
      text=false
     fi


     if $image && $text
     then
      mv "$file" "s-and-t"
     elif $image
     then
      mv "$file" "scanned"
     elif $text
     then
      mv "$file" "text"
     else
      echo "$file undecided"
     fi
    done    

Thanks

ZKS
  • 817
  • 3
  • 16
  • 31