I would like to read all images found in a pdf
file by PyMuPDF
as opencv
images, as close as they are from the source (avoiding funky format conversions that would lead to precision loss). Basically, I would like the result to be the exact same as if I was doing a cv2.imread(filename):
(in terms of the type it outputs, color space, etc...)
# Libraries
import os
import cv2
import fitz
import numpy as np
# Input file
filename = "myfile.pdf"
# Read all images in file as a list of opencv images
def read_images(filename):
images = []:
_, extension = os.path.splitext(filename)
# If it's a pdf process each image
if (extension == ".pdf"):
pdf = fitz.open(file)
for index in range(len(pdf)):
page = pdf[index]
for im in page.getImageList():
xref = im[0]
pix = fitz.Pixmap(pdf, xref)
images.append(pix_to_opencv_image(pix)) # DO SOMETHING HERE
# Otherwise just do an imread
else:
images.append(cv2.imread(filename))
return images
Basically I would like to know what the function pix_to_opencv_image
should be:
# Equivalent of doing a "cv2.imread" on a pdf pixmap:
def pix_to_opencv_image(pix):
# DO SOMETHING HERE
If found example explaining how to convert pdf pixmaps to numpy arrays, but nothing that outputs an opencv image.
How can I achieve this?