I am currently working on an Python 3.x image extractor for pdf-files and can't seem to find a solution for the problem I have been facing throughout my work. My intention is to extract all the images of pdf-files (vehicle reports) without the logos of the company that provides these papers. So far I have a working code using fitz, that finds the images and stores them (I found this code in the internet). Unfortunately they are returned in the wrong order. For annotating the pictures with their headings, they have to be saved in the order how they are seen in the pdf.
I already tried to get this right by using the object names defined in the xref-String (string defining an object in the pdf) in ascending order. Before that version I annotated the pictures with a counter through a dict (which I know is unsorted, but fixed it with sorting the keys), but had about 2-4 of approximatley 30 images unsorted. Additionally this code doens't seem to be a good solution for me because I 'fake' the image number by annotating a counter.
My current version (xref Name):
import fitz
import sys
import re
checkXO = r"/Type(?= */XObject)" # finds "/Type/XObject"
checkIM = r"/Subtype(?= */Image)" # finds "/Subtype/Image"
doc = fitz.open(fr"{pdfpath}")
lenXREF = doc._getXrefLength() # number of objects
pixmaps = {}
imgcount=0
count=0
imglist=[]
for i in range(1, lenXREF): # scan through all objects
text = doc._getXrefString(i) # string defining the object
isXObject = re.search(checkXO, text) # tests for XObject
isImage = re.search(checkIM, text) # tests for Image
if not isXObject or not isImage: # not an image object if not both True
continue
count+=1
pix = fitz.Pixmap(doc, i) # make pixmap from image
if re.search(r'Name \WIm(\d+)',text) != None:
imglist.append(re.search(r'Name \W(Im\d+)',text).group(1))
pixmaps[re.search(r'Name \W(Im\d+)',text).group(1)]=pix
if re.search(r'Name \W(Im\d+)',text) == None:
imglist.append(count)
pixmaps[count]=pix
imglist1=[]
for i in range(1,doc.pageCount):
if len(doc.getPageImageList(i))>1:
for entry in doc.getPageImageList(i):
imglist1.append(entry[7])
break
for entry in imglist1:
pixmaps[entry].writeImage(fr"{dirpath}\%s.jpg" % (imgcount),'jpg')
imgcount+=1
Feel free to also suggest a completely new way to work on this task. Thanks in advance for your help.