4

I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.

So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.

I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.

import glob
from docx import Document

headings = []
texts = []

def iter_headings(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Heading'):
            yield paragraph

def iter_text(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Normal'):
            yield paragraph

for name in glob.glob('/*.docx'):
    document = Document(name)
    for heading in iter_headings(document.paragraphs):
        headings.append(heading.text)
        for paragraph in iter_text(document.paragraphs):
            texts.append(paragraph.text)
    print(texts)

How do I extract the text and heading for each article?

This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.

https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png

I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.

  • Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: [mcve] – kjhughes Jan 28 '19 at 20:07
  • [`edit`](https://stackoverflow.com/posts/54409495/edit) and include the code you've got so far – chickity china chinese chicken Jan 28 '19 at 20:22
  • 1
    Ok, I added the code, hope it's more clear what I meant before. Sorry about that, it's my first post. – RaspberryGUIce Jan 28 '19 at 20:36

2 Answers2

1

I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:

for paragraphs in iterate_document_sections(document.paragraphs):
    create_document_from_paragraphs(paragraphs)

Then iterate_document_sections() would look something like:

def iterate_document_sections(document):
    """Generate a sequence of paragraphs for each headed section in document.

    Each generated sequence has a heading paragraph in its first position, 
    followed by one or more body paragraphs.
    """
    paragraphs = [document.paragraphs[0]]
    for paragraph in document.paragraphs[1:]:
        if is_heading(paragraph):
             yield paragraphs
             paragraphs = [paragraph]
             continue
        paragraphs.append(paragraph)
    yield paragraphs

Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading() and create_document_from_paragraphs().

Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections).

scanny
  • 26,423
  • 5
  • 54
  • 80
1

In fact, provided solution works well only if documents don't have any other elements except paragraphs (tables for example).

Another possible solution is to iterate not only through paragraphs but all document body's child xml elements. Once you find "subdocument's" start and end elements (paragraphs with headings in your example) you should delete other irrelevant to this part xml elements (a kind of cut off all other document content). This way you can preserve all styles, text, tables and other document elements and formatting. It's not an elegant solution and means that you have to keep a temporary copy of a full source document in memory.

This is my code:

import tempfile
from typing import Generator, Tuple, Union

from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph


def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
    """Iterate over sub-documents by splitting source document into parts
    Split into parts by copying source document and cutting off unrelevant
    data.

    Args:
        doc_path (str):                 path to source *docx* file
        skip_first (bool, optional):    skip first split point and wait for 
                                        second occurrence. Defaults to True.
        bias (int, optional):           split point bias. Defaults to 0.

    Yields:
        Generator[Tuple[int,DocType],None,None]:    first element of each tuple 
                                                    indicates the number of a 
                                                    sub-document, if number is 0 
                                                    then there are no sub-documents
    """
    doc = Document(doc_path)
    counter = 0
    while doc:
        split_elem_idx = -1
        doc_body = doc.element.body
        cutted = [doc, None]
        for idx, elem in enumerate(doc_body.iterchildren()):
            if is_split_point(elem):
                if split_elem_idx == -1 and skip_first:
                    split_elem_idx = idx
                else:
                    cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
                    counter += 1
                    break
        yield (counter, cutted[0])
        doc = cutted[1]

def is_split_point(element:BaseOxmlElement) -> bool:
    """Split criteria

    Args:
        element (BaseOxmlElement): oxml element

    Returns:
        bool: whether the *element* is the beginning of a new sub-document
    """
    if isinstance(element, CT_P):
        p = Paragraph(element, element.getparent())
        return p.text.startswith("Some text")
    return False

def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
    """Splitting into parts by copying source document and cutting of
    unrelevant data.

    Args:
        doc (DocType): [description]
        cut_idx (int): [description]

    Returns:
        Tuple[DocType,DocType]: [description]
    """
    tmpdocfile = write_tmp_doc(doc)
    second_part = doc
    second_elems = list(second_part.element.body.iterchildren())
    for i in range(0, cut_idx):
        remove_element(second_elems[i])
    first_part = Document(tmpdocfile)
    first_elems = list(first_part.element.body.iterchildren())
    for i in range(cut_idx, len(first_elems)):
        remove_element(first_elems[i])
    tmpdocfile.close()
    return (first_part, second_part)

def remove_element(elem: Union[CT_P,CT_Tbl]):
    elem.getparent().remove(elem)

def write_tmp_doc(doc:DocType):
    tmp = tempfile.TemporaryFile()
    doc.save(tmp)
    return tmp

Note that you should define is_split_point method according to your split criteria

Kanarsky
  • 162
  • 1
  • 10