0

Aim to split them into 2 folders only. Don't want to extract text or whatever.

  • Does this answer your question? [How to check if PDF is scanned image or contains text](https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text) – Savvas Nicolaou Apr 08 '22 at 04:23
  • Thanks @SavvasNicolaou, I found this snippet (https://stackoverflow.com/a/59421043/12307615) might work for a half pipeline. It prints out the pdf types. But how to store the PDFs into respective folder automatically? Imagine after running the code, all PDF files already split into 2 folders. – Teoh Sin Yee Apr 08 '22 at 04:26
  • To be honest I'm not sure. I haven't used python in a while...but you could try to use a loop and move each file based on searchability and filesize using import os. Unless it is something more complicated? – Savvas Nicolaou Apr 08 '22 at 04:34
  • 1
    Thanks @SavvasNicolaou. Have solved it recently. 1st, I loop through all files and check the PDF types of each of them. (Scanned-image, Non-scanned-image) Then use shutil to move the files to their respective folders. – Teoh Sin Yee Apr 12 '22 at 06:04

3 Answers3

0

You can use pdftotext library.

For this purpose you can use the following code:

import glob
import shutil
import pdftotext
from pathlib import Path

for pdf_file in glob.glob("pdf_folder/*.pdf"):

    # PDF file name
    pdf_name = Path(pdf_file).stem

    # Load PDF file
    with open(pdf_file, "rb") as f:
        pdf = pdftotext.PDF(f)

    # Check the first page of the PDF file and then move
    if pdf[0].strip() == '':
        # print('Image_based')
        shutil.move(pdf_file, f"image-based-folder/{pdf_name}.pdf")
    else:
        # print('Text_based')
        shutil.move(pdf_file, f"text-based-folder/{pdf_name}.pdf")
0

disclaimer I am the author of borb, the library used in this answer

import typing

from pathlib import Path

from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction

def contains_text(path_to_pdf: Path) -> bool:
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    # attempt to read the path
    with open(path_to_pdf, "rb") as fh:
        doc = PDF.loads(fh, [l])
    # check whether the document is a valid PDF
    if doc is None:
        return False
    # check the length of the text
    return len(l.get_text()[0]) > 0

Now you can simply used that method to determine whether a PDF (specified by its Path) contains text.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
0

There should be no truly definitive way to test.

  • A scanned PDF can include body text as OCR or comments like annotation.
  • Likewise a full bodied Novel can have an Image scanned page as a Cover.

so a truth table of has PDF construct /Font /Image can be [YY YN NY NN] thus 4 ways to split.

The scanned only should be the NY block of files and thus should be those found by

search file if NOT has /Font AND has /Image = scanned ELSE not scanned

However there are those printed as images e.g. NOT scanned that will match and those where /Image or /Font are not standard textual entries e.g. encoded

So for fastest means to split into 4 groups I used in Windows

md font & md image & md font+image & md neither
@for %f in (*.pdf) do @type "%~f" |find /n /i "Font" >nul&&move "%~f" font
@for %f in (font\*.pdf) do type "%~f" |find /i "Image" >nul&&move "%~f" "font+image"
@for %f in (*.pdf) do @type "%~f" |find /n /i "Image" >nul&&move "%~f" image
@for %f in (*.pdf) do @move "%~f" neither

Test Result (Total 45)

  • neither = 2
  • font = 7
  • font+image = 35
  • image = 1

but that is good enough to just retest the 1 visually for any fail (it was a printout of "text" using print as image only).

K J
  • 8,045
  • 3
  • 14
  • 36