0

Is there a way to save specific pages of one big docx into smaller docx files using python?

I've tried using python-docx but it doesn't seem to want to cooperate. Currently, I am saving it as a pdf, selecting the right pages for each different word document, and then saving them. Although, this is very slow and inefficient and it also loses some interactive elements such as checkboxes. There is also aspose-words but it has a huge watermark over it since it isn't free I guess

Is there an easier way to save specific pages into their own documents?

So to make it more concrete: Say I have a document called template.docx that has 5 pages. I want to save pages 1 and 5 into newdoc1.docx I want to save pages 2,3,4 into newdoc2.docx

Is there a way to do this nicely with some python libraries?

Eugene Astafiev
  • 47,483
  • 3
  • 24
  • 45
danielliucs
  • 61
  • 2
  • 6
  • 1
    To get an idea of where a problem is: See "Word Doesn't Know What a Page Is!" by Word MVP Daiya Mitchell https://wordmvp.com/Mac/PagesInWord.html and my page on Moving/Reorganizing Pages in Word https://addbalance.com/word/MovePages.htm#PageStart Word knows paragraphs, sentences, and sections, but not pages. – Charles Kenyon Jun 19 '23 at 18:20
  • Consider using the Open XML SDK for processing open XML documents, see [Welcome to the Open XML SDK 2.5 for Office](https://learn.microsoft.com/en-us/office/open-xml/open-xml-sdk) for more information. – Eugene Astafiev Jun 19 '23 at 18:53

2 Answers2

0

Unfortunately there is no way to detect a pages inside a docx file unless they separeted with page breaks. If that's the case you need to do following steps:

  1. Get all paragraps from Document
  2. Iterate on them and get their runs
  3. Check every run whether it contains page break or not (like this or this)
from docx import Document
from docx.text.paragraph import Paragraph

doc = Document("file.docx")
paragraphs = doc.paragraphs

for paragraph in paragraphs:
    for run in paragraph.runs:
        if 'lastRenderedPageBreak' in run._element.xml:  
            print('soft page break found at run') 
            # Do something
        elif 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
            print('hard page break found at run')
            # Do something

Then you can try create a separate documents by collecting paragraphs in between page breaks (be aware that run that contains page break could also contain a text). You can look at this question and related.

0

You can achieve this using Aspose.Words and Document.extract_pages method.

import aspose.words as aw

doc = aw.Document("C:\\Temp\\in.docx")

for page in range(0, doc.page_count) :
    pageDoc = doc.extract_pages(page, 1)
    pageDoc.save("C:\\Temp\\page_" + str(page) + ".docx")

This method uses Aspose.Words layout engine to detect pages in the document. The same layout engine is used for rendering document to fixed page formats, like PDF, XPS, images etc.

Alexey Noskov
  • 1,722
  • 1
  • 7
  • 13