4

I have several .docx files that contain a number of similar blocks of text: docx files that contain 300+ press releases that are 1-2 pages each, that need to be separated into individual text files. The only consistent way to tell differences between articles is that there is always and only a page break between 2 articles.

However, I don't know how to find page breaks when converting the encompassing Word documents to text, and the page break information is lost after the conversion using my current script

I want to know how to preserve HARD page breaks when converting a .docx file to .txt. It doesn't matter to me what they look like in the text file, as long as they're uniquely identifiable when scanning the text file later

Here is the script I am using to convert the docx files to txt:

def docx2txt(file_path):
    document = opendocx(file_path)
    text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
    paratextlist = getdocumenttext(document)
    newparatextlist = []
    for paratext in paratextlist:
        newparatextlist.append(paratext.encode("utf-8"))
    text_file.write('\n\n'.join(newparatextlist))
    text_file.close()
bend87
  • 43
  • 1
  • 3
  • There is likely no way to do it with python-docx out of the box, but save a sample docx file as a .zip file (to expose the raw .xml files which the document is made of), and poke around til you see something that looks like a page break. You can then add a function to your python-docx install to handle these as it reads through the file. The python-docx source files are well documented, and its not too complicated after a bit of poking around – wnnmaw Jun 13 '14 at 16:50
  • Even if you want to make things more complicated you can save the docx as HTML then split everything on the page breaks. – Bioto Jun 13 '14 at 16:58

1 Answers1

4

A hard page break will appear as a <w:br> element within a run element (<w:r>), something like this:

<w:p>
  <w:r>
    <w:t>some text</w:t>
    <w:br w:type="page"/>
  </w:r>
</w:p>

So one approach would be to replace all those occurrences with a distinctive string of text, like maybe "{{foobar}}".

An implementation of that would be something like this:

from lxml import etree
from docx import nsprefixes

page_br_elements = document.xpath(
    "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
    t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
    t.text = '{{foobar}}'
    br.addprevious(t)
    parent = br.getparent()
    parent.remove(br)

I don't have time to test this, so you might run into some missing imports or whatever, but everything you need should already be in the docx module. The rest is lxml method calls on _Element.

Let me know how you go and I can tweak this if needed.

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Dear @scanny, this code seems to be using the older version of python-docx. Is there a way to do this with the new python-docx library? given that both versions do not seem to coexist well, I had to uninstall the old one. – XAnguera Jul 14 '15 at 16:04
  • Better ask that one as a new question. If you tag it with `python-docx` I'll see it. – scanny Jul 14 '15 at 23:19
  • Thanks @scanny, I managed to use this code with a simple trick. I just created a virtual environment with the old python docx library inside and called its python any time I need this script. – XAnguera Jul 16 '15 at 20:53
  • @XAnguera Can you please mention the version of `python-docx`, so that others can also achieve the same! – Asif Ali Mar 15 '19 at 08:51