I'm trying to count the number of images within a PDF using Python and write the results to a csv file. Ideally, I would like to return a csv which shows a column for the file and a column for each page with the number of images in each page. But a column showing the file name and the total number of images in the document would suffice.
I have tried:
import fitz
import io
from PIL import Image
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results', 'error'])
for file in pdfs:
# open the file
pdf_file = fitz.open(file)
# printing number of images found in this page
if image_list:
results = len(image_list[0])
error = ""
#print(results)
#results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")
else:
error = str("! No images found on page", page_index)
propertyWriter.writerow([file, results, error])
Reference: https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/ However, with this option is declaring thee are 9 images in each PDF which isn't the case.
I have then tried:
import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results'])
for file in pdfs[0:5]:
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
results = str(pix)
propertyWriter.writerow([file, results])
Reference: Extract images from PDF without resampling, in python? But again this is saying there is the same number of images in each PDF which is not the case.