3

I am trying to convert the PDF to Image to proceed further with the Tesseract. It works when I convert using cmd:

magick convert a.pdf b.png

But doesn't work when I try to do the same using Python:

from wand.image import Image
with Image (filename='a.pdf') as img:
    img.save(filename = 'sample.png')`

The error I get is:

unable to read image data D:/Users/UserName/AppData/Local/Temp/magick-4908Cq41DDA5FxlX1 @ error/pnm.c/ReadPNMImage/1346

I have also installed ghostscipt but the error is still there.

EDIT:

I took the code provided in the reply below and modified it to read all the pages. The original issue is still there and the code below uses pdf2image:

from pdf2image import convert_from_path
import os
pdf_dir = "D:/Users/UserName/Desktop/scraping"
for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_name = pdf_file[:-4]

        for page in pages:
            page.save("%s-page%d.jpg" % (pdf_name, pages.index(page)), "JPEG")
eemamedo
  • 325
  • 1
  • 6
  • 14
  • It may be one of two things. Your Python environmental variable may be different from those of your system. Thus you may need to put the full path to Ghostscript (which ImageMagick uses to process PDF) into your PATH environment variable. Or it could be the ImageMagick policy for PDF/PS/EPS that needs to be relaxed. See https://stackoverflow.com/questions/52861946/imagemagick-not-authorized-to-convert-pdf-to-an-image/52863413#52863413. The latter is more likely the issue. – fmw42 Feb 19 '19 at 00:56
  • @fmw42 I will definitely check what you have posted but just a note; some PDFs can be read without any issues. I have checked with 4 different PDFs and the python approach worked on 2/4. Edit: I checked and I have included a path to GS. It's `C:\Program Files\gs\gs9.26\bin`. Is this a correct one or I need to put something else? – eemamedo Feb 19 '19 at 01:18
  • Paths should be wherever gs is installed. If it works on some PDF and not others, then it is not the gs path nor the policy. I would check to see that in the delegates.xml file for decode PS, you have it set to sDEVICE=pngalpha and not pnmraw. Also if your PDF files are CMYK and have an alpha channel, then you will need to convert to sRGB with alpha to process. Do that by `convert -density X -colorspace sRGB image.PDF ...`. Sorry I do not know the wand equivalents. Check to see if the PDFs that fail are CMYK with alpha. – fmw42 Feb 19 '19 at 01:38
  • For me this was missing ghostscript - so check that the code can in fact see it. Otherwise post a file for us to try out. – jtlz2 Jun 20 '19 at 07:48

1 Answers1

2

Instead of using wand.image, you can use pdf2image. Install it like this:

pip install pdf2image

Here is a code that loops through every page in the PDF, finally converting them to JPEG:

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename = os.path.splitext(os.path.basename(filename))[0] + '.jpg'     

save_dir = 'dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')
xilpex
  • 3,097
  • 2
  • 14
  • 45
  • 1
    I changed it, this should, work; Just change the `last_page` and `first_page` part. – xilpex Feb 18 '19 at 23:59
  • I was actually able to modify your code: `from pdf2image import convert_from_path import os pdf_dir = "D:/Users/UserName/Desktop/scraping" for pdf_file in os.listdir(pdf_dir): if pdf_file.endswith(".pdf"): pages = convert_from_path(pdf_file, 300) pdf_name = pdf_file[:-4] for page in pages: page.save("%s-page%d.jpg" % (pdf_name, pages.index(page)), "JPEG")` – eemamedo Feb 19 '19 at 00:17
  • Yes. Thank you for the tip. Still wonder why Wand throws a tantrum and doesn't want to work. But thank you for your help – eemamedo Feb 19 '19 at 00:21
  • No Problem. Good Luck! – xilpex Feb 19 '19 at 00:22