1

I am programming in Python, but if some tool/library exists in another language that would help me considerably, I am open to suggestions.

I have a large collection of pdf pages that live in a database, and I am trying to automate the collection of those pages to build some image recognition models with them.

These "pdfs" are actually just PNG images encased with a PDF wrapper (presumably so they can be read by PDF readers like Adobe Acrobat). I need the pdfs in image format to feed into the image recognition model pipeline. I am assuming they are PNG images, because when I save the images from the browser (i.e., right click and save image as), the resulting file is a PNG file.

After reading this question from 2010, and checking out this blog post from 2007, I've concluded that there must be a way to just extract the PNG byte array from the PDF instead of re-converting the PDF into a new image. Oddly though, I couldn't find the PNG file header with

#Python 3.6

header = bytes([137, 80, 78, 71, 13, 10, 26, 10])
#the resulting header looks like this: b'\x89PNG\r\n\x1a\n'
file.find(header)

Does that mean that the embedded image is not in fact a PNG image?


If there is no easy way to extract the embedded image byte array, what tool might I use to automate the conversion of each PDF file to some image format (preferably JPEG, PNG, or TIFF)?


Edit: I know tools like ImageMagick exist for format conversions, but I'd really rather do the extraction method for the sake of learning more about these file formats.

H Froedge
  • 197
  • 1
  • 8
  • 1
    If the PDF does indeed contain a raster image, you can extract it using pdfimages. See https://en.wikipedia.org/wiki/Pdfimages. – fmw42 Jul 28 '18 at 04:38
  • *"for the sake of learning more about these file formats."* - in that case simply start with the pdf specification ISO 32000. Adobe has shared a copy of part 1 on their web site which should suffice for the start. – mkl Jul 28 '18 at 15:22
  • 1
    PDF page content streams cannot contain PNG data. How do you know the PDF pages are just images? Are all the PDF files from the same source? If so, are they all stored using the same image compression? Also, does your image recognition model prefer certain input (e.g. greyscale TIFF?) – Ryan Jul 29 '18 at 06:37
  • PNG images are not stored as-is like with JPEG files in PDF but are re-encoded into a specific format using the same compression and filter algorithms as the PNG file format; in fact, the PDF spec refers to the PNG spec. This means that the *data streams* of some PNGs are directly embed-able into a PDF but not all (e.g. most PNGs with transparency). And the individual meta-info parts of a PNG have to be converted to their PDF counter parts. – gettalong Jul 30 '18 at 15:00

1 Answers1

0
pip install pdf2image
pip install pillow
pip install numpy
pip install opencv-python

Then,

import numpy as np 
from pdf2image import convert_from_path as read 
import PIL 
import cv2 
#pdf in the form of numpy array to play around with in OpenCV or PIL 
img = np.asarray(read('path to the pdf file')[0])#first page of pdf
cv2.imwrite('path to save the image with the file extension',img)