is there a feature in python docx library to compute the number of pages in a document?
-
http://stackoverflow.com/questions/12964580/number-of-pages-of-a-word-document-with-python – Al Lelopath Jul 22 '14 at 14:21
-
@mickey : that link did not helped me.. they show example using win32com.client library. i want to use python docx library. but thank you anyway :) – omri_saadon Jul 22 '14 at 14:48
2 Answers
Not at the moment, but, unlike a way to tell where the page breaks are in the content, such a feature could be developed. At least if you were satisfied with whatever Word reported last time it saved the document.
This statistic is saved in the app.xml properties "part" by Word on each save. So if you were confident the document you were inspecting had last been saved by Word (or LibreOffice I expect would work too), then that method should be pretty reliable. If the document were generated by, say, python-docx, that statistic would be unreliable.
If this is a feature you're interested in, feel free to add it to the GitHub issues list: https://github.com/python-openxml/python-docx/issues

- 26,423
- 5
- 54
- 80
I came up with this. Works for both pptx and docx files:
import zipfile
import re
archive = zipfile.ZipFile("myDocxOrPptxFile.docx", "r")
ms_data = archive.read("docProps/app.xml")
archive.close()
app_xml = ms_data.decode("utf-8")
regex = r"<(Pages|Slides)>(\d)</(Pages|Slides)>"
matches = re.findall(regex, app_xml, re.MULTILINE)
match = matches[0] if matches[0:] else [0, 0]
page_count = match[1]
print(page_count)
Office formats are just zip files with XML contents inside. You can read the contents of those files and parse them as you please.

- 14,427
- 9
- 57
- 85