Read PDF as a picture

Question

I have some pdf, I want to read them as pictures to get all the pixels info.

So I tried first to convert the pdf into jpeg:

from pdf2image import convert_from_path
img = convert_from_path('mypdf.pdf')

This works. Now I am gonna try to get the pixel info, but I have an error:

import matplotlib.pyplot as plt
pixel_img = plt.imread(img[0])

TypeError: Object does not appear to be a 8-bit string path or a Python file-like object

I don´t understand it, as the plt.imread() seems to work when I use it to read an original .jpeg. The img is a PIL object, so shouldn´t it be a "python file-like object"?

I also tried to use the PIL package (as img as a PIL object), and tried to read with a different method (but all I get is another mistake):

from PIL import Image    
pixel_img = Image.open(img[0])

AttributeError: 'PpmImageFile' object has no attribute 'read'

This link is not exactly as I want, because just save the pdf as jpg. But I don´t want to save it, I just want to read it and get the pixel info.

Thanks

You can convert it to file: https://stackoverflow.com/questions/46593477/convert-pil-image-object-to-file-object — Vaibhav Vishal, Aug 26 '19 at 12:21
Possible duplicate of [Python: Extract a page from a pdf as a jpeg](https://stackoverflow.com/questions/46184239/python-extract-a-page-from-a-pdf-as-a-jpeg) — Kostas Charitidis, Aug 26 '19 at 12:21
Not exactly, both links tell me how to convert from pdf to jpg and how get the pixel info from a jpg. But my problem is that when I tried to do both things at the same time, I get the errors — GonzaloReig, Aug 26 '19 at 12:44

Tankred · Accepted Answer · 2019-08-26T12:51:58.563

4

convert_from_path returns a list of PIL images, so you must not treat them as files.

The following converts the pages of a PDF to PIL images, converts the first page/image to a numpy array (for easy access to pixels) and gets the pixel at position y=10, x=15:

from pdf2image import convert_from_path
import numpy as np

images = convert_from_path('test.pdf')

# to numpy array
image = np.array(images[0])

# get pixel at position y=10, x=15
# where pix is an array of R, G, B.
# e.g. pix[0] is the red part of the pixel
pix = image[10,15]

edited Aug 26 '19 at 12:51

answered Aug 26 '19 at 12:31

Tankred

196
1
9

But with that I just save the file as jpg. What I want is to read it and get the pixel info, no save it as .jpg – GonzaloReig Aug 26 '19 at 12:40
images[0] is a standard PIL image. Maybe this will help you get the pixels: https://stackoverflow.com/a/11064935/5665958. Another possibility is to convert it to a numpy array, which might or might not be easier to work with (`np_image = numpy.array(images[0])`). – Tankred Aug 26 '19 at 12:42
I updated the answer, to show how to get a single pixel – Tankred Aug 26 '19 at 12:52

Read PDF as a picture

1 Answers1