I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.
So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.
I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
How do I extract the text and heading for each article?
This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.