0

I have a MS Word document that consists of several hundred pages.

Each page is identical apart from the name of a person which is unique across each page. (One page is one user).

I would like to take this word document and automate the process to save each page individually so I end up with several hundred word documents, one for each person, rather than one document that consists of everyone, that I can then distribute to the different people.

I have been using the module python-docx found here: https://python-docx.readthedocs.io/en/latest/

I am struggling on how to achieve this task.

As far as I have researched it is not possible to loop over each page as the pages are not determined in the .docx file itself but are generated by the program, i.e. Microsoft Word.

However python-docx can interpret the text and since each page is the same can I not say to python when you see this text (the last piece of text on a given page) consider this to be the end of a page and anything after this point is a new page.

Ideally if I could write a loop that would consider such a point and create a document up until that point, and repeating the over all pages that would be great. It would need to take all formatting/pictures as well.

I am not against other methods, such as converting to PDF first if that is an option.

Any thoughts?

stgy222
  • 27
  • 1
  • 4

3 Answers3

0

I had the exact same problem. Unfortunately I could not find a way to split .docx by page. The solution was to first use python-docx or docx2python (whatever you like) to iterate over each page and extract the unique (person) information and put it in a list so you end up with:

people = ['person_A', 'person_B', 'person_C', ....]

Then save the .docx as a pdf split the pdfs up by page and then save them as person_A.pdf etc like this:

from PyPDF2 import PdfFileWriter, PdfFileReader

inputpdf = PdfFileReader(open("document.pdf", "rb"))

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open(f"{people[i]}.pdf", "wb") as outputStream:
        output.write(outputStream)

The result is a bunch of one page PDFs saved as Person_A.pdf, Person_B.pdf etc. Hope that helps.

Doug
  • 234
  • 4
  • 11
0

I would suggest another package aspose-words-cloud to split a word document into separate pages. Currently, it works with cloud storage(Aspose cloud storage, Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage and FTP Storage). However, in near future, it will support process files from the request body(streams).

P.S: I am developer evangelist at Aspose.

# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile


# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxx-xxxxx-xxxx-xxxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxx'

words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'

remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = '02_pages.docx'
remoteFileName = '02_pages.docx'

#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))

#Split DOCX pages as a zip file
request = asposewordscloud.models.requests.SplitDocumentRequest(name=remoteFileName, format='docx', folder=remoteFolder, zip_output= 'true')
result = words_api.split_document(request)
print("Result {}".format(result.split_result.zipped_pages.href))

#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(result.split_result.zipped_pages.href)
response_download = words_api.download_file(request_download)
copyfile(response_download, 'C:/'+ result.split_result.zipped_pages.href)

Tilal Ahmad
  • 940
  • 5
  • 9
0

This is a DOCX Splitter Online, that to split the .docx file you uploaded into multiple .docx files (per page of the original), and to download them as a .zip file.

(tested 13th Apr 2023 with a 30-page .docx file)

https://products.groupdocs.app/splitter/docx

Mark K
  • 8,767
  • 14
  • 58
  • 118