Avoid extracting repeat images from PDF

Question

I'm attempting to use Optical Character Recognition on scanned image pdfs in python. This requires extracting the images within the text, and this is where I've run into issues. Some of the PDFs I'm extracting from have a ~header image on every page that gets extracted once per page. Is there a way to avoid this? I'm primarily trying to reduce the number of images I have to feed into my OCR algorithm.

Currently I do image extraction with the following two methods, though I'm ok with using a different method (though I've had multiple hours of difficulty just trying to install textract and haven't gotten it so far, so maybe not that package).

Method 1: Poppler's pdfimages tool via command line via os

def image_exporter(pdf_path, output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    cmd = ['pdfimages', '-png', '-p', pdf_path,
           '{}/prefix'.format(output_dir)]
    subprocess.call(cmd)
    print('Images extracted')

Method 2: Fitz/PyMuPDF

def img_extract(pdf_path, output_dir):
    name_start = pdf_path.split('\\')[-1][:15]
    doc = fitz.open(pdf_path)
    for i in range(len(doc)):
        for img in doc.getPageImageList(i):
            xref = img[0]
            pix = fitz.Pixmap(doc, xref)
            name = "p%s-%s.png" % (i, xref)
            name = name_start + ' ' + name
            name = output_dir + '\\' + name
            if pix.n < 5:       # this is GRAY or RGB
                #pix.writePNG(name)
                pix.writeImage(name)
            else:               # CMYK: convert to RGB first
                pix1 = fitz.Pixmap(fitz.csRGB, pix)
                #pix1.writePNG(name)
                pix1.writeImage(name)
                pix1 = None
            pix = None

Both of these are essentially copies of code I found elsewhere (Extract images from PDF without resampling, in python? being one of the sources).

I should also mention that I have very little understanding of the structure of a pdf document itself. So I'm having other odd things happening (extracting inversed color images, super blurred images, image1 of pg_x where its got text with random letters missing & image2 of same pg_x with the random missing letters but none of the letters in image1, etc). So perhaps an equally valid question is if there is a way to combine all the images on one page into a single image that I can scan with my OCR code? I'm primarily trying to avoid having to scan through huge quantities of images.

Maybe creating a set of pix objects before writing any of them could work (I'm unsure if the images are actually the exact exact same b.c they're technically different headers from different pages). (Editted - originally a reply to a now deleted comment). — Evan Mata, Apr 03 '19 at 17:07

Avoid extracting repeat images from PDF

0 Answers0