python-docx: iterate through paragraphs, tables and images while keeping order

Question

this is my first time posting here, I want to write a script that takes a docx as input and selects certain paragraphs(including tables and images) to copy in the same order into another template document(not at the end). The problem I'm having is when I start iterating over the elements my code is unable to detect the images, therefore I'm unable to determine where an image is relative to the text and tables nor which image is it. In short I got doc1 with: TEXT IMAGE TEXT TABLE TEXT

and what I end up with is: TEXT [IMAGE MISSING] TEXT TABLE TEXT

What I got so far:

-I can iterate over the paragraphs and tables:

def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
    parent_elm = parent.element.body
    # print(parent_elm.xml)
elif isinstance(parent, _Cell):
    parent_elm = parent._tc
else:
    raise ValueError("something's not right")

for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
    elif isinstance(child, CT_Tbl):
        yield Table(child, parent)

I can get an ordered list of the images of a document:

pictures = []
for pic in dwo.inline_shapes:
    if pic.type == WD_INLINE_SHAPE.PICTURE:
        pictures.append(pic)

I can insert at the end of a paragraph an specific image:

def insert_picture(index, paragraph):
    inline = pictures[index]._inline
    rId = inline.xpath('./a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed')[0]
    image_part = dwo.part.related_parts[rId]
    image_bytes = image_part.blob
    image_stream = BytesIO(image_bytes)
    paragraph.add_run().add_picture(image_stream, Inches(6.5))
    return

I use the function iter_block_items() like this:

start_copy = False
for block in iter_block_items(document):
    if isinstance(block, Paragraph):
        if block.text == "TEXT FROM WHERE WE STOP COPYING":
            break

    if start_copy:
        if isinstance(block, Paragraph):
            last_paragraph = insert_paragraph_after(last_paragraph,block.text)

        elif isinstance(block, Table):
            paragraphs_with_table.append(last_paragraph)
            tables_to_apppend.append(block._tbl)

    if isinstance(block, Paragraph):
        if block.text == ""TEXT FROM WHERE WE START COPYING":
            start_copy = True

score 3 · Answer 1 · answered Oct 22 '19 at 18:13

3

You can find a working implementation of this which does exactly the same in the following link:

Extracting paras, tables and images in document order

answered Oct 22 '19 at 18:13

Karthick Mohanraj

1,565
2
13
28

1

This is an amazing accomplishment... the expression "silk purse out of a sow's ear" comes to mind. Unfortunately the amount of processing required really slows things down if you have to do lots of documents. Simple xml processing and extraction is probably a more practical approach, e.g. https://stackoverflow.com/a/33775294/595305 – mike rodent Apr 16 '21 at 19:24

score 0 · Accepted Answer · answered Oct 17 '18 at 14:23

I found a way to do it, turns out the images I wanted to sort were already inside the paragraphs as inline.shape. I used this: link to extract the images, and then inserted them using a modified version of

def insert_picture(index, paragraph):

where instead of index I would use rId.

mike rodent · Answer 3 · 2022-01-30T13:50:54.860

There are (at least) two possibilities here: either use xml (or lxml) or use a ready-made alternative Python module.

The alternative Python module (i.e. not python-docx) is docx2python. You use it like this:

docx_obj = docx2python(path)
body = docx_obj.body

The structure in body does indeed then contain text and tables in the correct order, which python-docx is not able to do (pretty bad flaw).

This dox2python project seems to be alive, although the author says on the above-linked page that he "won't be coding much in 2022". It seems to work OK as far as I can tell. It is important to read the notes about how tables and non-table text will be created as a structure.

At the bottom of the page there is some stuff that is well worth reading about why his version 2 is better than version 1. I haven't checked that he has indeed implemented this, but if so this means that it will in fact be superior in some ways to the alternative "pure lxml" solution below (e.g. consecutive runs and links).

There is a second way of picking apart a Word document: a Word document is in fact a .zip file, and inside there are various components. This is one way to count the paragraphs, for example.

from lxml import etree
WORD_SCHEMA_STRING = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
with open(file_path, 'rb') as f:
    zip_file = zipfile.ZipFile(f)
    xml_content_bytes = zip_file.read('word/document.xml')
    doc_content_xml_tree_source = etree.fromstring(xml_content_bytes)
    for i_node, node in enumerate(doc_content_xml_tree_source.iter(tag=etree.Element)):
        if node.tag == WORD_SCHEMA_STRING + 'p':    
            n_paras += 1

You basically have to do a bit of exploring to see how "document.xml" is put together... and be aware that there are various other significant documents in that zip file. But using the above technique you have all the xml nodes exposed, giving you freedom to do anything you need to.

I'm not sure whether you need the external package lxml any more (i.e. rather than xml). I think I read somewhere that the speed of the latter is much improved. But I use lxml as I think it is probably still significantly faster than the standard library xml package.

python-docx: iterate through paragraphs, tables and images while keeping order

3 Answers3