How to extract images from a scanned pdf

Question

I use Tesseract to extract text from scanned PDF. Some of these files also contain images. Is there a way to get those images?

I prepare my scanned pdf for tesseract by converting them in tiff files. But I can't find any command line tool to extract images from them, as pdfimages would do for "text" pdf.

Any idea of a tool (or a combination of tools) that would help me do the job?

@MarkSetchell When I use pdfimages with scanned pdf, it extracts complete pages, not just the images. I think it's a tool just for "text" pdf's, not scanned ones. — Plouf, Nov 06 '17 at 09:46
A scanned PDF usually contains one bitmap image per page, and on this bitmap image there is all the scanned content of that page. A separation of text-like from the rest usually does not happen. So when you extract the image resources from the PDF, you'll get bitmaps of the whole page contents. — mkl, Nov 06 '17 at 09:52
@MarkSetchell True. But with Tesseract, i get the text from that bitmap image converted to tiff. I'm looking for a tool to do the same with images. — Plouf, Nov 06 '17 at 09:54

score 3 · Answer 1 · answered Nov 07 '17 at 20:13

You won't be able to use Tesseract OCR for images, as that's not what it was designed to do. Best to use a tool to extract the images beforehand, and then get the text later using Tesseract.

You may get some use out of PDFimages, by xPDF.

http://www.xpdfreader.com/pdfimages-man.html

You will need to download R, Rstudio, xPDFreader, and PDFtools to accomplish this. Make sure your program files are able to be found in "Environment Variables" (if using Windows) so that R can find the programs.

Then do something like this to convert it. See the options in documentation for help on PDFimages. This is just how the syntax will be (specifically after paste0). Note the placement of the options. They have to be before the file input name:

  #("PDF to PPM")      
      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = 
 "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 300 ", i,".pdf", " ",i)))
      })

You could also just use the CMD prompt and type

pdftoppm -f 1 -l 10 -r 300 stuff.pdf stuff.ppm

Thanks @Mitchell but I suspect my question wasn't clear enough :) Let me try to clarify: of course I know you can't do it with Tesseract, that's why I asked for a potential command line tool that maybe doesn't exist. I tried your solution but as mentionned before, it doesn't detect images in PDF (or in tiff) but extract the whole page as an image, which is not what I'm looking for. — Plouf, Nov 08 '17 at 10:03

score 3 · Answer 2 · answered Oct 11 '20 at 23:38

1. Extract the images using pdfimages

pdfimages mydoc.pdf

2. Use the following extraction script:

./extractImages.py images*

Find your cut out images in a new images folder. Look at what was done in the tracing folder to make sure no images were missed.

Operation

It will process all images and look for shapes inside the images. If a shape is found and is larger than a configurable size it fill figure out the maximum bounding box, cut out the image and save it in a new images, in addition it will create folder named traces where it shows all the bounding boxes.

If you want to find smaller images, just decrease the minimumWidth and minimumHeight however if you set it too low it will find each character.

In my tests it works extremely well, it just finds a few too many images.

extractImages.py

#!/bin/env python 

import cv2
import numpy as np
import os
from pathlib import Path

def extractImagesFromFile(inputFilename, outputDirectory, tracing=False, tracingDirectory=""):
    
    # Settings:
    minimumWidth = 100
    minimumHeight = 100
    greenColor = (36, 255, 12)
    traceWidth = 2
    
    # Load image, grayscale, Otsu's threshold
    image = cv2.imread(inputFilename)
    original = image.copy()
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    # Find contours, obtain bounding box, extract and save ROI
    ROI_number = 1
    cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    for c in cnts:
        x, y, w, h = cv2.boundingRect(c)
        if w >= minimumWidth and h >= minimumHeight:
            cv2.rectangle(image, (x, y), (x + w, y + h), greenColor, traceWidth)
            ROI = original[y:y+h, x:x+w]
            outImage = os.path.join(outputDirectory, '{}_{}.png'.format(Path(inputFilename).stem, ROI_number))
            cv2.imwrite(outImage, ROI)
            ROI_number += 1
    if tracing:
        outImage = os.path.join(tracingDirectory, Path(inputFilename).stem + '_trace.png')
        cv2.imwrite(outImage, image)

def main(files):

    tracingEnabled = True
    outputDirectory = 'images'
    tracingDirectory = 'tracing'

    # Create the output directory if it does not exist
    outputPath = Path.cwd() / outputDirectory
    outputPath.mkdir(exist_ok=True)

    if tracingEnabled:
        tracingPath = Path.cwd() / tracingDirectory
        tracingPath.mkdir(exist_ok=True)

    for f in files:
        print("Prcessing {}".format(f))
        if Path(f).is_file():
            extractImagesFromFile(f, outputDirectory, tracingEnabled, tracingDirectory)
        else:
            print("Invalid file: {}".format(f))

if __name__ == "__main__":
    import argparse
    from glob import glob
    parser = argparse.ArgumentParser()  
    parser.add_argument("fileNames", nargs='*') 
    args = parser.parse_args()  
    fileNames = list()  
    for arg in args.fileNames:  
        fileNames += glob(arg)  
    main(fileNames)

Credit

The basic algorithm was provided by nathancy as an answer to this question:

Extract all bounding boxes using OpenCV Python

JosephA · Answer 3 · 2017-11-17T04:43:31.963

In many cases when someone has a PDF and they want to 'get' the images out, a rendering of the page itself to an image is often satisfactory. However, if you do indeed want to extract the images you need to be careful what tool you use and investigate its reputation and quality of its output.

The first important thing to realize is if a tool claims to "extract the TIFF out of the PDF" or "extract the JPG out of the PDF" then they are misleading you as PDF doesn't contain JPEG or TIFF images per say. The confusions arises because the compression technology that can be used by those two raster image formats is employed in PDF for compressing image data but it's not the same thing as a JPG file simply 'living' with a PDF.

There are many tools out there, however you will find the quality will vary widely. Some can handle simple PDFs well, but have size limitations or complex PDFs simply make it crash or hang. Some can handle RGB data well, but it simply skips or mishandles other color spaces. Some won't let you have granular control over the data and will simply extract everything and recompress it as JPEG. To top all of that off, often the image data can be corrupt in some way and the technology you're using has to be able to gracefully handle those scenarios.

If you plan on deploying this as part of an enterprise solution you need a tool capable of handling most any PDF you can find out there in the wild.