I'm attempting to use Optical Character Recognition on scanned image pdfs in python. This requires extracting the images within the text, and this is where I've run into issues. Some of the PDFs I'm extracting from have a ~header image on every page that gets extracted once per page. Is there a way to avoid this? I'm primarily trying to reduce the number of images I have to feed into my OCR algorithm.
Currently I do image extraction with the following two methods, though I'm ok with using a different method (though I've had multiple hours of difficulty just trying to install textract and haven't gotten it so far, so maybe not that package).
Method 1: Poppler's pdfimages tool via command line via os
def image_exporter(pdf_path, output_dir):
if not os.path.exists(output_dir):
os.makedirs(output_dir)
cmd = ['pdfimages', '-png', '-p', pdf_path,
'{}/prefix'.format(output_dir)]
subprocess.call(cmd)
print('Images extracted')
Method 2: Fitz/PyMuPDF
def img_extract(pdf_path, output_dir):
name_start = pdf_path.split('\\')[-1][:15]
doc = fitz.open(pdf_path)
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
name = "p%s-%s.png" % (i, xref)
name = name_start + ' ' + name
name = output_dir + '\\' + name
if pix.n < 5: # this is GRAY or RGB
#pix.writePNG(name)
pix.writeImage(name)
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
#pix1.writePNG(name)
pix1.writeImage(name)
pix1 = None
pix = None
Both of these are essentially copies of code I found elsewhere (Extract images from PDF without resampling, in python? being one of the sources).
I should also mention that I have very little understanding of the structure of a pdf document itself. So I'm having other odd things happening (extracting inversed color images, super blurred images, image1 of pg_x where its got text with random letters missing & image2 of same pg_x with the random missing letters but none of the letters in image1, etc). So perhaps an equally valid question is if there is a way to combine all the images on one page into a single image that I can scan with my OCR code? I'm primarily trying to avoid having to scan through huge quantities of images.