3

I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.

My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.

Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python

From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.

Any ideas?

Ofer Sadan
  • 11,391
  • 5
  • 38
  • 62

3 Answers3

6

If your goal is to redact images maybe this code I used for a similar usecase could be useful:

import sys
import zipfile
from PIL import Image, ImageFilter
import io

blur = ImageFilter.GaussianBlur(40)

def redact_images(filename):
    outfile = filename.replace(".docx", "_redacted.docx")
    with zipfile.ZipFile(filename) as inzip:
        with zipfile.ZipFile(outfile, "w") as outzip:
            for info in inzip.infolist():
                name = info.filename
                print(info)
                content = inzip.read(info)
                if name.endswith((".png", ".jpeg", ".gif")):
                        fmt = name.split(".")[-1]
                        img = Image.open(io.BytesIO(content))
                        img = img.convert().filter(blur)
                        outb = io.BytesIO()
                        img.save(outb, fmt)
                        content = outb.getvalue()
                        info.file_size = len(content)
                        info.CRC = zipfile.crc32(content)
                outzip.writestr(info, content)

Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.

mata
  • 67,110
  • 10
  • 163
  • 162
  • This works quite nicely actually, and will be very useful for my usecase. I was hoping for something native to `docx` files which will **remove** the images (from the xml?), but your solution is creative and works for me, so i'll select it for now (if no other "native" solution pops up) – Ofer Sadan Jun 21 '18 at 11:10
  • @ofer : could you poste the adaptation of the code needed to delete the image ? – Romain Jouin May 04 '19 at 20:37
  • Ingenious solution, but it disturbs the aspect ratio of the image if we try to replace it. Any thoughts on how to take that into account ? – Hissaan Ali Jul 15 '22 at 18:01
  • Not sure what would cause that, since that shouldn't really be affected by applying a filter, it should still have the original size and aspect ratio. I've not had that problem as far as I can remember. – mata Jul 18 '22 at 16:20
1

I don't think it's currently implemented in python-docx.

Pictures in the Word Object Model are defined as either floating shapes or inline shapes. The docx documentation states that it only supports inline shapes.

The Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request to add this functionality - which dates back to 2014! If it's not added to paragraphs it won't be available for InlineShapes as they are implemented as discrete paragraphs.

You could do this with win32com if you have a machine with Word and Python installed. This would allow you to call the Word Object Model directly, giving you access to the Delete() method. In fact you could probably cheat - rather than scrolling through the document to get each image, you can call Find and Replace to clear the image. This SO question talks about win32com find and replace:

import win32com.client
from os import getcwd, listdir

docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file

FromTo = {"First Name":"John",
      "Last Name":"Smith"} #You can insert as many as you want

word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
    word.Documents.Open('{}\\{}'.format(getcwd(), doc))
    for From in FromTo.keys():
        word.Selection.Find.Text = From
        word.Selection.Find.Replacement.Text = FromTo[From]
        word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2  
    name = doc.rsplit('.',1)[0]
    ext = doc.rsplit('.',1)[1]
    word.ActiveDocument.SaveAs('{}\\{}_2.{}'.format(getcwd(), name, ext))

word.Quit() # releases Word object from memory

In this case since we want images, we would need to use the short-code ^g as the find.Text and blank as the replacement.

word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)
Alan
  • 2,914
  • 2
  • 14
  • 26
0

I don't know about this library, but looking through the documentation I found this section about images. It mentiones that it is currently not possible to insert images other than inline. If that is what you currently have in your documents, I assume you can also retrieve these by looking in the Document object and then remove them?

The Document is explained here.

Although not a duplicate, you might also want to look at this question's answer where user "scanny" explains how he finds images using the library.

JustLudo
  • 1,690
  • 12
  • 29