4

I have a docx file which contains 6-7 images. I need to automate the extraction of images from this doc file. Is there any win32com ms word API for the same? Or any library that can accurately extract all the images in it?

This is what I have tried but the problem is first of all its not giving me all the images, secondly its giving me many false poitive images, like the blank image, extremely small images, lines etc... Its also using the MS word to do the same.

from pathlib import Path
from win32com.client import Dispatch

xls = Dispatch("Excel.Application")
doc = Dispatch("Word.Application")


def export_images(fp, prefix="img_", suffix="png"):
    """ export all of images(inlineShapes) in the word file.
    :param fp: path of word file.
    :param prefix: prefix of exported images.
    :param suffix: suffix of exported images.
    """

    fp = Path(fp)
    word = doc.Documents.Open(str(fp.resolve()))
    sh = xls.Workbooks.Add()
    for idx, s in enumerate(word.inlineShapes, 1):
        s.Range.CopyAsPicture()
        d = sh.ActiveSheet.ChartObjects().add(0, 0, s.width, s.height)
        d.Chart.Paste()
        d.Chart.Export(fp.parent / ("%s_%s.%s" % (prefix, idx, suffix))
    sh.Close(False)
    word.Close(False)
export_images(r"C:\Users\HPO2KOR\Desktop\Work\venv\us2017010202.docx")

You can download the docx file here https://drive.google.com/open?id=1xdw2MieI1n3ulXlkr_iJSKb3cbozdvWq

Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93

4 Answers4

5

You can unzip all images from docx preliminarily filtered them by size:

import zipfile

archive = zipfile.ZipFile('file.docx')
for file in archive.filelist:
    if file.filename.startswith('word/media/') and file.file_size > 300000:
        archive.extract(file)

In your example 5 images were found:

enter image description here

Alderven
  • 7,569
  • 5
  • 26
  • 38
  • How do I remove those extra images – Himanshu Poddar Feb 13 '20 at 08:30
  • You can filter them by size. See my updated answer. – Alderven Feb 13 '20 at 08:52
  • For some reason, some of the images come out with the wrong orientation. For example, in word I'll start with a blank document. I insert some JPEGS whose source files are in portrait orientation. They remain in portrait orientation in the Word document. I save the document, and then run the above code on the Word file. The images that are extracted are now in landscape orientation (rotated 90 degrees counterclockwise from the original source file and the Word document). – JoeMjr2 Oct 28 '20 at 20:54
0

In your enumeration loop, you should probably check that the shape type is a picture:

for idx, s in enumerate(word.inlineShapes, 1):
    if s.Type != 3: # wdInlineShapePicture
        continue
    # ...
Torben Klein
  • 2,943
  • 1
  • 19
  • 24
0

Adding one more approach to do the same. We can use doc2txt library to get all the images

import docx2txt
text = docx2txt.process("docx_file", r"directory where you want to store the images")

Note that it also gives all the text found in the file, in the text variable.

Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93
0

Extract all the images in a docx file using python

1. Using docxtxt

import docx2txt
#extract text 
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")

2. Using aspose

import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
    shape = shape.as_shape()
    if (shape.has_image) :
        # set image file's name
        imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
        # save image
        shape.image_data.save(imageFileName)
        imageIndex += 1
dataninsight
  • 1,069
  • 6
  • 13
  • 1
    Do not post [identical answers](https://stackoverflow.com/a/70065629/13138364) to multiple questions. Please customize these answers to the specific question. – tdy Nov 22 '21 at 16:31