I have a lot of PDF
files, which are basically scanned documents so every page is one scanned image. I want to perform OCR
and extract text from those files. I have tried pytesseract
but it does not perform OCR
directly on pdf
files so as a work around, I want to extract the images
from PDF
files, save them in directory and then perform OCR
using pytesseract
on those images directly. Is there any way in python to extract scanned images from pdf
files? or is there any way to perform OCR
directly on pdf files?
Asked
Active
Viewed 4,700 times
0

Haroon S.
- 2,533
- 6
- 20
- 39
1 Answers
3
This question has been addressed in previous Stack Overflow Posts.
Converting PDF to images automatically
Converting a PDF to a series of images with Python
Here is a script that may be helpful: https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html
Another method: https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick
Please check previous posts before asking a question.
EDIT:
Including working script for future reference. Program works for Python3.6 on Windows:
# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.
import sys
with open("Link/To/PDF/File.pdf", "rb") as file:
pdf = file.read()
startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0
njpg = 0
while True:
istream = pdf.find(b"stream", i)
if istream < 0:
break
istart = pdf.find(startmark, istream, istream + 20)
if istart < 0:
i = istream + 20
continue
iend = pdf.find(b"endstream", istart)
if iend < 0:
raise Exception("Didn't find end of stream!")
iend = pdf.find(endmark, iend - 20)
if iend < 0:
raise Exception("Didn't find end of JPG!")
istart += startfix
iend += endfix
print("JPG %d from %d to %d" % (njpg, istart, iend))
jpg = pdf[istart:iend]
with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
jpgfile.write(jpg)
njpg += 1
i = iend

pragmaticprog
- 550
- 2
- 15
-
1I couldn't find any method that is working with Python 3.6. I am using Anaconda on Windows. – Haroon S. May 26 '18 at 20:17
-
1I just ran the code from the comments section of the example script I linked to (https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html). I was able to get it working on my Windows Machine running Python 3.6. Let me know if you are still having issues. – pragmaticprog May 26 '18 at 20:53
-
1Thank you for your effort. Yes this one is working fine. Upovting. – Haroon S. May 26 '18 at 21:42
-
Sweet! Glad I could help! – pragmaticprog May 26 '18 at 22:24
-
Works on Kubuntu 22.04 running Python 3.10.7 .Thanks. – Jack Griffin Feb 23 '23 at 15:53