35

So the state I'm in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then scanned (our government at its best eh?). At first I thought I was crazy, but then I started seeing numerous pdfs that are 'tilted', like someone didn't get them on the scanner properly. So, I figured the next best thing to getting the actual text out of them, would be to turn each page into an image.

Obviously this needs to be automated, and I'd prefer to stick with Python if possible. If Ruby or Perl have some form of implementation that's just too awesome to pass up, I can go that route. I've tried pyPDF for text extraction, that obviously didn't do me much good. I've tried swftools, but the images I'm getting from that are just shy of completely unusable. It just seems like the fonts get ruined in the conversion. I also don't even really care about the image format on the way out, just as long as they're relatively lightweight, and readable.

f4nt
  • 2,661
  • 5
  • 31
  • 35
  • 5
    before you do that, contact the .gov entity that produces the files. You may very well be able to get easy access to that actual digital files. Having worked in .gov and ran into that same problem, it's usually due to antiquated legal requirements (paper signatures) and/or a lack of technical understanding (often, this stuff will bypass IT/web team where they would be able to catch it). You can also call them out on the accessibility issue as a giant JPG of a page is completely inaccessible to assistive technology. – DA. Jan 04 '10 at 20:43
  • Also, to be fair to .gov land, they often have to cater to an incredibly wide technological chasm. Alas, we still live in a time where the lowest common denominator is a paper form. – DA. Jan 04 '10 at 20:46
  • Voted to close: See http://stackoverflow.com/questions/331918/converting-a-pdf-to-a-series-of-images-with-python . – Brian Jan 04 '10 at 20:58
  • @Brian this question adds a unique variable: rotating the PDF/image. – DA. Jan 04 '10 at 21:25
  • 1
    There's another twist here: these PDFs are scans, the other question is about arbitrary PDFs. – Ned Batchelder Jan 04 '10 at 22:38
  • 1
    did you find a solution without using external libraries? I posted a novel question about this topic but I only got downvotes from stubborn and frustrated programmers.. – DaniPaniz Mar 25 '17 at 14:50
  • Possible duplicate of [Python: Extract a page from a pdf as a jpeg](https://stackoverflow.com/questions/46184239/python-extract-a-page-from-a-pdf-as-a-jpeg) – Basj May 22 '18 at 21:46

7 Answers7

15

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.

You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • 1
    Does this only work for JPGs? Because when I tried it on my own pdf's it would immediately stop after failing to find the start. – Ivo Flipse Feb 24 '13 at 14:01
  • 1
    I spent 2 hours on truly scanned PDFs (by a Konica Bizhub copier), and tried nearly *every solution* of https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python, but none of them worked: the copier has cropped every scanned page into dozains of small images (maybe TIFF but I don't know why, maybe for OCR purpose?). Thus `If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF` didn't apply to my case. – Basj May 22 '18 at 21:46
  • this is very good but maybe exist a method for get only a specific page from pdf ..? thanks !! – Diego Avila Jun 04 '18 at 14:56
10

You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).

Antoine P.
  • 4,181
  • 1
  • 24
  • 17
  • 1
    I would love to find a solution without using external libraries, because I need to make this conversion on a shared server that cannot install ImageMagick or Ghostscript, any ideA how? – DaniPaniz Mar 25 '17 at 14:51
  • @DaniPaniz After spending hours on this (I also first wanted to avoid external libraries) and trying lots of Python libraries, just use `pdftoppm`, it's by far the best tool like this! – Basj May 22 '18 at 21:47
  • how convert PPM files to png o jpg ? – Kiquenet Jan 02 '19 at 12:26
6

Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API. For more information:

Community
  • 1
  • 1
Scott Willeke
  • 8,884
  • 1
  • 40
  • 52
1

Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with ImagePrinter Pro. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.

import win32api
import os

def pdf_to_jpg(pdfPath, pages):
    # print pdf using jpg printer
    # 'pages' is the number of pages in the pdf
    filepath = pdfPath.rsplit('/', 1)[0]
    filename = pdfPath.rsplit('/', 1)[1]

    #print pdf to jpg using jpg printer
    tempprinter = "ImagePrinter Pro"
    printer = '"%s"' % tempprinter
    win32api.ShellExecute(0, "printto", filename, printer,  ".",  0)

    # Add time delay to ensure pdf finishes printing to file first
    fileFound = False
    if pages > 1:
        jpgName = filename.split('.')[0] + '_' + str(pages - 1) + '.jpg'
    else:
        jpgName = filename.split('.')[0] + '.jpg'
    jpgPath = filepath + '/' + jpgName
    waitTime = 30
    for i in range(waitTime):
        if os.path.isfile(jpgPath):
            fileFound = True
            break
        else:
            time.sleep(1)

    # print Error if the file was never found
    if not fileFound:
        print "ERROR: " + jpgName + " wasn't found after " + str(waitTime)\
              + " seconds"

    return jpgPath

The resulting jpgPath variable tells you the path location of the last jpeg page of the pdf printed. If you need to get another page, you can easily add some logic to modify the path to get prior pages

Jed
  • 1,823
  • 4
  • 20
  • 52
1

in pdf_to_jpg(pdfPath)

      6     # 'pages' is the number of pages in the pdf
      7     filepath = pdfPath.rsplit('/', 1)[0]
----> 8     filename = pdfPath.rsplit('/', 1)[1]
      9 
     10     #print pdf to jpg using jpg printer

IndexError: list index out of range

Andrew_Lvov
  • 4,621
  • 2
  • 25
  • 31
1

With Wand there are now excellent imagemagick bindings for Python that make this a very easy task.

Here is the code necessary for converting a single PDF file into a sequence of PNG images:

from wand.image import Image

input_path = "name_of_file.pdf"
output_name = "name_of_outfile_{index}.png"
source = Image(filename=upload.original.path, resolution=300, width=2200)
images = source.sequence
for i in range(len(images)):
    Image(images[0]).save(filename=output_name.format(i))
shezi
  • 560
  • 3
  • 18
0

Below is a method to save a PNG image to disk:

def thumbnail(pdf_pathname):
    images = Image(filename=pdf_pathname)
    images.convert('png').save(filename="thumb.png")
Tom Russell
  • 1,015
  • 2
  • 10
  • 29