1

I am currently working on docx files and I am using the w:lastRenderedPageBreak as a marker for every page's content. It is necessary that I determine if a page has already ended.

My current code is like this:

from docx import Document
document = Document(file)
for p in document.paragraphs:
  if 'lastRenderedPageBreak' in p._element.xml:
     # do something
  # rest of code here

Now the problem I encountered is that a docx file that has 4 pages only has 2 w:lastRenderedPageBreak tags. I tried opening the docx file and saving it but the w:lastRenderedPageBreak tags do not increase.

The only time that the w:lastRenderedPageBreak would properly show the page breaks is when I open the docx file and save it as an XML file.

Is there any way to skip the saving as XML part to properly see the lastrenderedpagebreaks while parsing the text and formatting using python-docx? I want to do it in python, win32com, or vba if possible.

Edit: The reason I want the w:lastRenderedPageBreak is I had issues when handling footnotes while parsing content as they were formatted the same way with normal text (problem with source and can't be fixed). The only difference is that they have a superscript number at the beginning. Here lies the need to determine if a page has already ended since currently if the script does not know if the page has already ended, it will continue to include the text from the next page into the footnote until it finds a w:lastRenderedPageBreak.

Ex: I want the docx's XML to change from this:

Footnote 1: Text here. \p Additional text here that belongs to footnote 1. Footnote 2: Text here. new page text starts here...

into this:

Footnote 1: Text here. \p Additional text here that belongs to footnote 1. Footnote 2: Text here. <w:lastRenderedPageBreak> new page text starts here...

All text are contained in frames so no need to worry about page size, orientation, and margin. It does not matter how the docx will look as long as end of page or beginning of new page could be marked in content or xml.

Frederick
  • 117
  • 2
  • 9
  • There are no explicit pages in a document unless you explicitly enter a page break. If the file is printed on a different type of paper, eg A4 vs Letter, the pages will change. This is true for all word processors, not just Word. If you print or display a document, the pages and their contents will be calculated at runtime based on the medium's size, margins etc. – Panagiotis Kanavos Apr 05 '21 at 13:18
  • *PDF* on the other hand, is neither editable nor a word processing format. It's essentially print commands (Postscript specifically). If you try to display or print a PDF file on a different medium you'll the same number of pages with stretching or cropping to fit the medium. That's why reading PDF documents on a phone is so painful – Panagiotis Kanavos Apr 05 '21 at 13:22
  • Forgot to add, all text in the file are contained in text boxes/frames. I checked the XML and the boxes already have a h:x and h:y so it does not really matter if my file is opened in an A4 page format or a letter format. I just want pagebreaks to be added and retained once loaded. – Frederick Apr 05 '21 at 13:46
  • `I just want pagebreaks to be added` you'll have to add then yourself then. Again, Word isn't PDF or a desktot publishing application. Just like HTML or LaTeX, pages are calculated dynamically based on the media and content, not the other way around – Panagiotis Kanavos Apr 05 '21 at 13:53
  • Why do you want hard-coded page numbers? What are you trying to do? Why not add page references or numbers? – Panagiotis Kanavos Apr 05 '21 at 13:55
  • I meant w:lastRenderedPageBreak sorry. Its not about the page numbers but I am currently using python-docx to parse the contents of each page including the text's formatting. I had issues when handling footnotes as they were formatted the same way with the normal text. The only difference is that they have a superscript number at the beginning. Here lies the need to determine if a page has already ended since currently if the script does not know if the page has already ended, it will continue to include the text from the next page into the footnote until it finds a w:lastRenderedPageBreak. – Frederick Apr 05 '21 at 14:01
  • 1
    That's not how footnotes work. They're [a special element](http://officeopenxml.com/anatomyofOOXML.php), stored in a `FootNotes` part and always displayed in the footer *after* pagination. Again, Word is little different from LaTex (except for the parts-as-XML files ... pat). You don't need to know how a page is paginated to find its footnotes or their references. That's how Word's table of contents and list of footnotes, images, work. – Panagiotis Kanavos Apr 05 '21 at 14:08
  • `The only difference is that they have a superscript number at the beginning` do you mean that the document *doesn't* use footnotes, but someone tried to emulate them with superscripts? That's .... evil. And guaranteed to fail even if a printer with slightly different margins is selected. The way documents work this way is to *avoid* such fragility. Is that why text boxes were used in every page? To "fix" pagination and footnotes that weren't broken until someone tried to "fix" them? – Panagiotis Kanavos Apr 05 '21 at 14:11
  • Yeah I know how normal footnotes work but this is really what makes the issue complex. Because footnotes for this case are not inside footnote elements. I am not sure how they (courts) made this but I have to deal with it. I can accurately mark which content blocks are footnotes but I can't mark where each page ends/begins. – Frederick Apr 05 '21 at 14:13
  • Yes you are correct. I have to deal with this evil hence I ended up looking for a solution wherein each end of page/ start of page/ could be properly marked. The text boxes are also another weirdness that the docs came with. I have no idea what monstrosity they used to make these doc files but thankfully docx does not really care if texts are in frames or not. – Frederick Apr 05 '21 at 14:16
  • I had to work with such unfortunate documents (books actually) in the past, created by authors that didn't quite understand how Word works. What I did was fix them - convert the fake footnotes, TOCs etc to real ones. Otherwise even a slight change by the author could end up moving text and footnotes into different pages. Adding one paragraph or one line could mess up an entire chapter if not an entire book. The authors hadn't tried to invent their own pages though. Fortunately, the author was a retired general, not a "smart" programmer out to fix M$ – Panagiotis Kanavos Apr 05 '21 at 14:20
  • It may be easier in this case to use a VBA macro in the document to do whatever you need to do. The macro will see the pages *after* pagination. – Panagiotis Kanavos Apr 05 '21 at 14:23

1 Answers1

1

w:lastRenderedPageBreak has too many limitations to be useful as an indicator of pagination:

  1. If a document has never been rendered, there will be no w:lastRenderedPageBreak elements.

  2. If a document has been changed since being rendered, existing w:lastRenderedPageBreak elements will be stale.

  3. Rendering can depend upon characteristics of the target media.

  4. Rendering can depend upon line- and page- breaking algorithms or details of their implementations.

  5. Even if one can live with limitations #1 through #4, w:lastRenderedPageBreak is has historically had reliability issues.

For further details, see:

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks for the summary. I can live through 1-4 and some of the issues from #5 are already handled by a logic in my script. So far I have not seen tables and images in the docs I have tested. All the content are placed in frames so the page size and margin does not really matter since I just want to know where does a page end so I could polish the current logic that handles content continuity. – Frederick Apr 05 '21 at 13:54
  • You accept that the ground is quicksand, yet you still wish to build on it. Good luck. – kjhughes Apr 05 '21 at 15:12