Aim to split them into 2 folders only. Don't want to extract text or whatever.
-
Does this answer your question? [How to check if PDF is scanned image or contains text](https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text) – Savvas Nicolaou Apr 08 '22 at 04:23
-
Thanks @SavvasNicolaou, I found this snippet (https://stackoverflow.com/a/59421043/12307615) might work for a half pipeline. It prints out the pdf types. But how to store the PDFs into respective folder automatically? Imagine after running the code, all PDF files already split into 2 folders. – Teoh Sin Yee Apr 08 '22 at 04:26
-
To be honest I'm not sure. I haven't used python in a while...but you could try to use a loop and move each file based on searchability and filesize using import os. Unless it is something more complicated? – Savvas Nicolaou Apr 08 '22 at 04:34
-
1Thanks @SavvasNicolaou. Have solved it recently. 1st, I loop through all files and check the PDF types of each of them. (Scanned-image, Non-scanned-image) Then use shutil to move the files to their respective folders. – Teoh Sin Yee Apr 12 '22 at 06:04
3 Answers
You can use pdftotext
library.
For this purpose you can use the following code:
import glob
import shutil
import pdftotext
from pathlib import Path
for pdf_file in glob.glob("pdf_folder/*.pdf"):
# PDF file name
pdf_name = Path(pdf_file).stem
# Load PDF file
with open(pdf_file, "rb") as f:
pdf = pdftotext.PDF(f)
# Check the first page of the PDF file and then move
if pdf[0].strip() == '':
# print('Image_based')
shutil.move(pdf_file, f"image-based-folder/{pdf_name}.pdf")
else:
# print('Text_based')
shutil.move(pdf_file, f"text-based-folder/{pdf_name}.pdf")

- 501
- 4
- 11
disclaimer I am the author of borb
, the library used in this answer
import typing
from pathlib import Path
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction
def contains_text(path_to_pdf: Path) -> bool:
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
# attempt to read the path
with open(path_to_pdf, "rb") as fh:
doc = PDF.loads(fh, [l])
# check whether the document is a valid PDF
if doc is None:
return False
# check the length of the text
return len(l.get_text()[0]) > 0
Now you can simply used that method to determine whether a PDF (specified by its Path
) contains text.

- 8,483
- 2
- 23
- 54
There should be no truly definitive way to test.
- A scanned PDF can include body text as OCR or comments like annotation.
- Likewise a full bodied Novel can have an Image scanned page as a Cover.
so a truth table of has PDF construct /Font /Image can be [YY YN NY NN] thus 4 ways to split.
The scanned only should be the NY block of files and thus should be those found by
search file if NOT has /Font AND has /Image = scanned ELSE not scanned
However there are those printed as images e.g. NOT scanned that will match and those where /Image or /Font are not standard textual entries e.g. encoded
So for fastest means to split into 4 groups I used in Windows
md font & md image & md font+image & md neither
@for %f in (*.pdf) do @type "%~f" |find /n /i "Font" >nul&&move "%~f" font
@for %f in (font\*.pdf) do type "%~f" |find /i "Image" >nul&&move "%~f" "font+image"
@for %f in (*.pdf) do @type "%~f" |find /n /i "Image" >nul&&move "%~f" image
@for %f in (*.pdf) do @move "%~f" neither
Test Result (Total 45)
- neither = 2
- font = 7
- font+image = 35
- image = 1
but that is good enough to just retest the 1 visually for any fail (it was a printout of "text" using print as image only).

- 8,045
- 3
- 14
- 36