I am trying to convert the PDF to Image to proceed further with the Tesseract. It works when I convert using cmd:
magick convert a.pdf b.png
But doesn't work when I try to do the same using Python:
from wand.image import Image
with Image (filename='a.pdf') as img:
img.save(filename = 'sample.png')`
The error I get is:
unable to read image data D:/Users/UserName/AppData/Local/Temp/magick-4908Cq41DDA5FxlX1 @ error/pnm.c/ReadPNMImage/1346
I have also installed ghostscipt but the error is still there.
EDIT:
I took the code provided in the reply below and modified it to read all the pages. The original issue is still there and the code below uses pdf2image:
from pdf2image import convert_from_path
import os
pdf_dir = "D:/Users/UserName/Desktop/scraping"
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
pages = convert_from_path(pdf_file, 300)
pdf_name = pdf_file[:-4]
for page in pages:
page.save("%s-page%d.jpg" % (pdf_name, pages.index(page)), "JPEG")